What do the constraints "Rah" and "Ral" mean in extended inline assembly? - gcc

This question is inspired by a question asked by someone on another forum. In the following code what does the extended inline assembly constraint Rah and Ral mean. I haven't seen these before:
#include<stdint.h>
void tty_write_char(uint8_t inchar, uint8_t page_num, uint8_t fg_color)
{
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
}
void tty_write_string(const char *string, uint8_t page_num, uint8_t fg_color)
{
while (*string)
tty_write_char(*string++, page_num, fg_color);
}
/* Use the BIOS to print the first command line argument to the console */
int main(int argc, char *argv[])
{
if (argc > 1)
tty_write_string(argv[1], 0, 0);
return 0;
}
In particular are the use of Rah and Ral as constraints in this code:
asm (
"int $0x10"
:
: "b" ((uint16_t)page_num<<8 | fg_color),
"Rah"((uint8_t)0x0e), "Ral"(inchar));
The GCC Documentation doesn't have an l or h constraint for either simple constraints or x86/x86 machine constraints. R is any legacy register and a is the AX/EAX/RAX register.
What am I not understanding?

What you are looking at is code that is intended to be run in real mode on an x86 based PC with a BIOS. Int 0x10 is a BIOS service that has the ability to write to the console. In particular Int 0x10/AH=0x0e is to write a single character to the TTY (terminal).
That in itself doesn't explain what the constraints mean. To understand the constraints Rah and Ral you have to understand that this code isn't being compiled by a standard version of GCC/CLANG. It is being compiled by a GCC port called ia16-gcc. It is a special port that targets 8086/80186 and 80286 and compatible processors. It doesn't generate 386 instructions or use 32-bit registers in code generation. This experimental version of GCC is to target 16-bit environments like DOS (FreeDOS, MSDOS), and ELKS.
The documentation for ia16-gcc is hard to find online in HTML format but I have produced a copy for the recent GCC 6.3.0 versions of the documentation on GitHub. The documentation was produced by building ia16-gcc from source and using make to generate the HTML. If you review the machine constraints for Intel IA-16—config/ia16 you should now be able to see what is going on:
Ral The al register.
Rah The ah register.
This version of GCC doesn't understand the R constraint by itself anymore. The inline assembly you are looking at matches that of the parameters for Int 0x10/Ah=0xe:
VIDEO - TELETYPE OUTPUT
AH = 0Eh
AL = character to write
BH = page number
BL = foreground color (graphics modes only)
Return:
Nothing
Desc: Display a character on the screen, advancing the cursor
and scrolling the screen as necessary
Other Information
The documentation does list all the constraints that are available for the IA16 target:
Intel IA-16—config/ia16/constraints.md
a
The ax register. Note that for a byte operand,
this constraint means that the operand can go into either al or ah.
b
The bx register.
c
The cx register.
d
The dx register.
S
The si register.
D
The di register.
Ral
The al register.
Rah
The ah register.
Rcl
The cl register.
Rbp
The bp register.
Rds
The ds register.
q
Any 8-bit register.
T
Any general or segment register.
A
The dx:ax register pair.
j
The bx:dx register pair.
l
The lower half of pairs of 8-bit registers.
u
The upper half of pairs of 8-bit registers.
k
Any 32-bit register group with access to the two lower bytes.
x
The si and di registers.
w
The bx and bp registers.
B
The bx, si, di and bp registers.
e
The es register.
Q
Any available segment register—either ds or es (unless one or both have been fixed).
Z
The constant 0.
P1
The constant 1.
M1
The constant -1.
Um
The constant -256.
Lbm
The constant 255.
Lor
Constants 128 … 254.
Lom
Constants 1 … 254.
Lar
Constants -255 … -129.
Lam
Constants -255 … -2.
Uo
Constants 0xXX00 except -256.
Ua
Constants 0xXXFF.
Ish
A constant usable as a shift count.
Iaa
A constant multiplier for the aad instruction.
Ipu
A constant usable with the push instruction.
Imu
A constant usable with the imul instruction except 257.
I11
The constant 257.
N
Unsigned 8-bit integer constant (for in and out instructions).
There are many new constraints and some repurposed ones.
In particular the a constraint for the AX register doesn't work like other versions of GCC that target 32-bit and 64-bit code. The compiler is free to choose either AH or AL with the a constraint if the values being passed are 8 bit values. This means it is possible for the a constraint to appear twice in an extended inline assembly statement.
You could have compiled your code to a DOS EXE with this command:
ia16-elf-gcc -mcmodel=small -mregparmcall -march=i186 \
-Wall -Wextra -std=gnu99 -O3 int10h.c -o int10h.exe
This targets the 80186. You can generate 8086 compatible code by omitting the -march=i186 The generated code for main would look something like:
00000000 <main>:
0: 83 f8 01 cmp ax,0x1
3: 7e 1d jle 22 <tty_write_string+0xa>
5: 56 push si
6: 89 d3 mov bx,dx
8: 8b 77 02 mov si,WORD PTR [bx+0x2]
b: 8a 04 mov al,BYTE PTR [si]
d: 20 c0 and al,al
f: 74 0d je 1e <tty_write_string+0x6>
11: 31 db xor bx,bx
13: b4 0e mov ah,0xe
15: 46 inc si
16: cd 10 int 0x10
18: 8a 04 mov al,BYTE PTR [si]
1a: 20 c0 and al,al
1c: 75 f7 jne 15 <main+0x15>
1e: 31 c0 xor ax,ax
20: 5e pop si
21: c3 ret
22: 31 c0 xor ax,ax
24: c3 ret
When run with the command line int10h.exe "Hello, world!" should print:
Hello, world!
Special Note: The IA16 port of GCC is very experimental and does have some code generation bugs especially when higher optimization levels are used. I wouldn't use it for mission critical applications at this point in time.

Related

Outputting Az-Za in MS-Debug

I'm currently struggling in outputting AzByCx in ms debug since I don't really know too much about it.
Included here are the Basic commands that my teacher sent to us.
I can output A-Z easily, but not AzBxCy, there's not much of a detailed tutorial online so I came here to ask.
-e 200 "Az"
-a
1BD8:0100 mov bx, word [200]
1BD8:0104 mov cx, 3
1BD8:0107 mov ah, 2
1BD8:0109 mov dl, bl
1BD8:010B int 21
1BD8:010D mov dl, bh
1BD8:010F int 21
1BD8:0111 add bx, FF01
1BD8:0115 loop 109
1BD8:0117 mov dl, 0D
1BD8:0119 int 21
1BD8:011B mov dl, 0A
1BD8:011D int 21
1BD8:011F mov ax, 4C00
1BD8:0122 int 21
1BD8:0124
-g
AzByCx
What does this do?
initialises word at ds:200h to the string "Az" (little endian low byte is "A", high byte is "z")
use A command to assemble the following program:
load from memory into register bx (to initialise the register without having to look up the numeric ASCII codepoints)
set up cx as loop counter for three iterations
set up register ah for the interrupt 21h service 02h (display character in dl)
display character read from bl (low byte of bx)
display character from bh (high byte of bx)
add 1 to bl and -1 (in two's complement resulting in 0FFh) to bh which increments the codepoint in bl and decrements the one in bh
loop back to display subsequent letters
display a linebreak (codepoints 13 and 10, or 0Dh 0Ah) to make the output easier to read
terminate program
use G command to run the program
Microsoft DEBUG.EXE can indeed be used to write a simple program in Intel-syntax assembly, see this example. Replace the string definition db "Hello world$" with db "AzByCx$" and you're done.
However, this is definitely not a good way how to learn assembly language.
Find a better teacher.

Understanding Hello World's lea macOS x86-64 [duplicate]

Consider the following variable reference in x64 Intel assembly, where the variable a is declared in the .data section:
mov eax, dword ptr [rip + _a]
I have trouble understanding how this variable reference works. Since a is a symbol corresponding to the runtime address of the variable (with relocation), how can [rip + _a] dereference the correct memory location of a? Indeed, rip holds the address of the current instruction, which is a large positive integer, so the addition results in an incorrect address of a?
Conversely, if I use x86 syntax (which is very intuitive):
mov eax, dword ptr [_a]
, I get the following error: 32-bit absolute addressing is not supported in 64-bit mode.
Any explanation?
1 int a = 5;
2
3 int main() {
4 int b = a;
5 return b;
6 }
Compilation: gcc -S -masm=intel abs_ref.c -o abs_ref:
1 .section __TEXT,__text,regular,pure_instructions
2 .build_version macos, 10, 14
3 .intel_syntax noprefix
4 .globl _main ## -- Begin function main
5 .p2align 4, 0x90
6 _main: ## #main
7 .cfi_startproc
8 ## %bb.0:
9 push rbp
10 .cfi_def_cfa_offset 16
11 .cfi_offset rbp, -16
12 mov rbp, rsp
13 .cfi_def_cfa_register rbp
14 mov dword ptr [rbp - 4], 0
15 mov eax, dword ptr [rip + _a]
16 mov dword ptr [rbp - 8], eax
17 mov eax, dword ptr [rbp - 8]
18 pop rbp
19 ret
20 .cfi_endproc
21 ## -- End function
22 .section __DATA,__data
23 .globl _a ## #a
24 .p2align 2
25 _a:
26 .long 5 ## 0x5
27
28
29 .subsections_via_symbols
GAS syntax for RIP-relative addressing looks like symbol + current_address (RIP), but it actually means symbol with respect to RIP.
There's an inconsistency with numeric literals:
[rip + 10] or AT&T 10(%rip) means 10 bytes past the end of this instruction
[rip + a] or AT&T a(%rip) means to calculate a rel32 displacement to reach a, not RIP + symbol value. (The GAS manual documents this special interpretation)
[a] or AT&T a is an absolute address, using a disp32 addressing mode. This isn't supported on OS X, where the image base address is always outside the low 32 bits. (Or for mov to/from al/ax/eax/rax, a 64-bit absolute moffs encoding is available, but you don't want that).
Linux position-dependent executables do put static code/data in the low 31 bits (2GiB) of virtual address space, so you can/should use mov edi, sym there, but on OS X your best option is lea rdi, [sym+RIP] if you need an address in a register. Unable to move variables in .data to registers with Mac x86 Assembly.
(In OS X, the convention is that C variable/function names are prepended with _ in asm. In hand-written asm you don't have to do this for symbols you don't want to access from C.)
NASM is much less confusing in this respect:
[rel a] means RIP-relative addressing for [a]
[abs a] means [disp32].
default rel or default abs sets what's used for [a]. The default is (unfortunately) default abs, so you almost always want a default rel.
Example with .set symbol values vs. a label
.intel_syntax noprefix
mov dword ptr [sym + rip], 0x11111111
sym:
.equ x, 8
inc byte ptr [x + rip]
.set y, 32
inc byte ptr [y + rip]
.set z, sym
inc byte ptr [z + rip]
gcc -nostdlib foo.s && objdump -drwC -Mintel a.out (on Linux; I don't have OS X):
0000000000001000 <sym-0xa>:
1000: c7 05 00 00 00 00 11 11 11 11 mov DWORD PTR [rip+0x0],0x11111111 # 100a <sym> # rel32 = 0; it's from the end of the instruction not the end of the rel32 or anywhere else.
000000000000100a <sym>:
100a: fe 05 08 00 00 00 inc BYTE PTR [rip+0x8] # 1018 <sym+0xe>
1010: fe 05 20 00 00 00 inc BYTE PTR [rip+0x20] # 1036 <sym+0x2c>
1016: fe 05 ee ff ff ff inc BYTE PTR [rip+0xffffffffffffffee] # 100a <sym>
(Disassembling the .o with objdump -dr will show you that there aren't any relocations for the linker to fill in, they were all done at assemble time.)
Notice that only .set z, sym resulted in a with-respect-to calculation. x and y were original from plain numeric literals, not labels, so even though the instruction itself used [x + RIP], we still got [RIP + 8].
(Linux non-PIE only): To address absolute 8 wrt. RIP, you'd need AT&T syntax incb 8-.(%rip). I don't know how to write that in GAS intel_syntax; [8 - . + RIP] is rejected with Error: invalid operands (*ABS* and .text sections) for '-'.
Of course you can't do that anyway on OS X, except maybe for absolute addresses that are in range of the image base. But there's probably no relocation that can hold the 64-bit absolute address to be calculated for a 32-bit rel32.
Related:
How to load address of function or label into register AT&T version of this
32-bit absolute addresses no longer allowed in x86-64 Linux? PIE vs. non-PIE executables, when you have to use position-independent code.

Do C/C++ compilers such as GCC generally optimize modulo by a constant power of 2?

Let's say I have something like:
#define SIZE 32
/* ... */
unsigned x;
/* ... */
x %= SIZE;
Would the x % 32 generally be reduced to x & 31 by most C/C++ compilers such as GCC?
Yes, any respectable compiler should perform this optimization. Specifically, a % X operation, where X is a constant power of two will become the equivalent of an & (X-1) operation.
GCC will even do this with optimizations turned off:
Example (gcc -c -O0 version 3.4.4 on Cygwin):
unsigned int test(unsigned int a) {
return a % 32;
}
Result (objdump -d):
00000000 <_test>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 8b 45 08 mov 0x8(%ebp),%eax
6: 5d pop %ebp
7: 83 e0 1f and $0x1f,%eax ;; Here
a: c3 ret

Arithmetic shift using intel intrinsics

I have a set of bits, for example: 1000 0000 0000 0000 which is 16 bits and therefore a short. I would like to use an arithmetic shift so that I use the MSB to assign the rest of the bits:
1111 1111 1111 1111
and if I started off with 0000 0000 0000 0000:
after arithmetically shifting I would still have: 0000 0000 0000 0000
How can I do this without resorting to assembly? I looked at the intel intrinsics guide and it looks like I need to do this using AVX extensions but they look at data types larger than my short.
As mattnewport states in his answer, C code can do the job efficiently though with the possibility of 'implementation defined' behavior. This answer shows how to avoid the implementation defined behavior while preserving efficient code generation.
Because the question is about shifting a 16-bit operand, the concern over the implementation defined decision whether to sign extend or zero fill can be avoided by first sign extending the operand to 32-bits. Then the 32-bit value can be right shifted as unsigned, and finally truncated back to 16-bits.
Mattnewport's code actually sign extends the 16-bit operand to an int (32-bit or 64-bit depending on the compiler model) before shifting. This is because the language specification (C99 6.5.7 bitwise shift operators) requires a first step: integer promotions are performed on each of the operands. Likewise, the result of mattnewport's code is an int because The type of the result is that of the promoted left operand. Because of this, the code variation that avoids implementation defined behavior generates the same number of instructions as mattnewport's original code.
To avoid implementation defined behavior, the implicit promotion to signed int is replaced with an explicit promotion to unsigned int. This eliminates any possibility of implementation defined behavior while maintaining the same code efficiency.
The idea can be extended to cover 32-bit operands and works efficiently when 64-bit native integer support is present. Here is an example:
// use standard 'll' for long long print format
#define __USE_MINGW_ANSI_STDIO 1
#include <stdio.h>
#include <stdint.h>
// Code provided by mattnewport
int16_t aShiftRight16x (int16_t val, int count)
{
return val >> count;
}
// This variation avoids implementation defined behavior
int16_t aShiftRight16y (int16_t val, int count)
{
uint32_t uintVal = val;
uint32_t uintResult = uintVal >> count;
return (int16_t) uintResult;
}
// A 32-bit arithmetic right shift without implementation defined behavior
int32_t aShiftRight32 (int32_t val, int count)
{
uint64_t uint64Val = val;
uint64_t uint64Result = uint64Val >> count;
return (int32_t) uint64Result;
}
int main (void)
{
int16_t val16 = 0x8000;
int32_t val32 = 0x80000000;
int count;
for (count = 0; count <= 15; count++)
printf ("%04hX %04hX %08X\n", aShiftRight16x (val16, count),
aShiftRight16y (val16, count),
aShiftRight32 (val32, count));
return 0;
}
Here is gcc 4.8.1 x64 code generation:
0000000000000030 <aShiftRight16x>:
30: 0f bf c1 movsx eax,cx
33: 89 d1 mov ecx,edx
35: d3 f8 sar eax,cl
37: c3 ret
0000000000000040 <aShiftRight16y>:
40: 0f bf c1 movsx eax,cx
43: 89 d1 mov ecx,edx
45: d3 e8 shr eax,cl
47: c3 ret
0000000000000050 <aShiftRight32>:
50: 48 63 c1 movsxd rax,ecx
53: 89 d1 mov ecx,edx
55: 48 d3 e8 shr rax,cl
58: c3 ret
Here is MS visual studio x64 code generation:
aShiftRight16x:
00: 0F BF C1 movsx eax,cx
03: 8B CA mov ecx,edx
05: D3 F8 sar eax,cl
07: C3 ret
aShiftRight16y:
10: 0F BF C1 movsx eax,cx
13: 8B CA mov ecx,edx
15: D3 E8 shr eax,cl
17: C3 ret
aShiftRight32:
20: 48 63 C1 movsxd rax,ecx
23: 8B CA mov ecx,edx
25: 48 D3 E8 shr rax,cl
28: C3 ret
program output:
8000 8000 80000000
C000 C000 C0000000
E000 E000 E0000000
F000 F000 F0000000
F800 F800 F8000000
FC00 FC00 FC000000
FE00 FE00 FE000000
FF00 FF00 FF000000
FF80 FF80 FF800000
FFC0 FFC0 FFC00000
FFE0 FFE0 FFE00000
FFF0 FFF0 FFF00000
FFF8 FFF8 FFF80000
FFFC FFFC FFFC0000
FFFE FFFE FFFE0000
FFFF FFFF FFFF0000
I'm not sure why you are looking for intrinsics for this. Why not just use ordinary C++ right shift? This behavior is implementation defined but AFAIK on Intel platforms it will always sign extend.
int16_t val = 1 << 15; // 1000 0000 0000 0000
int16_t shiftVal = val >> 15; // 1111 1111 1111 1111

Fixed point math with ARM Cortex-M4 and gcc compiler

I'm using Freescale Kinetis K60 and using the CodeWarrior IDE (which I believe uses GCC for the complier).
I want to multiply two 32 bit numbers (which results in a 64 bit number) and only retain the upper 32 bits.
I think the correct assembly instruction for the ARM Cortex-M4 is the SMMUL instruction. I would prefer to access this instruction from C code rather than assembly. How do I do this?
I imagine the code would ideally be something like this:
int a,b,c;
a = 1073741824; // 0x40000000 = 0.5 as a D0 fixed point number
b = 1073741824; // 0x40000000 = 0.5 as a D0 fixed point number
c = ((long long)a*b) >> 31; // 31 because there are two sign bits after the multiplication
// so I can throw away the most significant bit
When I try this in CodeWarrior, I get the correct result for c (536870912 = 0.25 as a D0 FP number). I don't see the SMMUL instruction anywhere and the multiply is 3 instructions (UMULL, MLA, and MLA -- I don't understand why it is using a unsigned multiply, but that is another question). I have also tried a right shift of 32 since that might make more sense for the SMMUL instruction, but that doesn't do anything different.
The problem you get with optimizing that code is:
08000328 <mul_test01>:
8000328: f04f 5000 mov.w r0, #536870912 ; 0x20000000
800032c: 4770 bx lr
800032e: bf00 nop
your code doesnt do anything runtime so the optimizer can just compute the final answer.
this:
.thumb_func
.globl mul_test02
mul_test02:
smull r2,r3,r0,r1
mov r0,r3
bx lr
called with this:
c = mul_test02(0x40000000,0x40000000);
gives 0x10000000
UMULL gives the same result because you are using positive numbers, the operands and results are all positive so it doesnt get into the signed/unsigned differences.
Hmm, well you got me on this one. I would read your code as telling the compiler to promote the multiply to a 64 bit. smull is two 32 bit operands giving a 64 bit result, which is not what your code is asking for....but both gcc and clang used the smull anyway, even if I left it as an uncalled function, so it didnt know at compile time that the operands had no significant digits above 32, they still used smull.
Perhaps the shift was the reason.
Yup, that was it..
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 31;
return(c);
}
gives
both gcc and clang (well clang recycles r0 and r1 instead of using r2 and r3)
08000340 <mul_test04>:
8000340: fb81 2300 smull r2, r3, r1, r0
8000344: 0fd0 lsrs r0, r2, #31
8000346: ea40 0043 orr.w r0, r0, r3, lsl #1
800034a: 4770 bx lr
but this
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b);
return(c);
}
gives this
gcc:
08000340 <mul_test04>:
8000340: fb00 f001 mul.w r0, r0, r1
8000344: 4770 bx lr
8000346: bf00 nop
clang:
0800048c <mul_test04>:
800048c: 4348 muls r0, r1
800048e: 4770 bx lr
So with the bit shift the compilers realize that you are only interested in the upper portion of the result so they can discard the upper portion of the operands which means smull can be used.
Now if you do this:
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 32;
return(c);
}
both compilers get even smarter, in particular clang:
0800048c <mul_test04>:
800048c: fb81 1000 smull r1, r0, r1, r0
8000490: 4770 bx lr
gcc:
08000340 <mul_test04>:
8000340: fb81 0100 smull r0, r1, r1, r0
8000344: 4608 mov r0, r1
8000346: 4770 bx lr
I can see that 0x40000000 considered as a float where you are keeping track of the decimal place, and that place is a fixed location. 0x20000000 would make sense as the answer. I cant yet decide if that 31 bit shift works universally or just for this one case.
A complete example used for the above is here
https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/sample01
and I did run it on an stm32f4 to verify it works and the results.
EDIT:
If you pass the parameters into the function instead of hardcoding them within the function:
int myfun ( int a, int b )
{
return(a+b);
}
The compiler is forced to make runtime code instead of optimize the answer at compile time.
Now if you call that function from another function with hardcoded numbers:
...
c=myfun(0x1234,0x5678);
...
In this calling function the compiler may choose to compute the answer and just place it there at compile time. If the myfun() function is global (not declared as static) the compiler doesnt know if some other code to be linked later will use it so even near the call point in this file it optimizes an answer it still has to produce the actual function and leave it in the object for other code in other files to call, so you can still examine what the compiler/optimizer does with that C code. Unless you use llvm for example where you can optimize the whole project (across files) external code calling this function will use the real function and not a compile time computed answer.
both gcc and clang did what I am describing, left runtime code for the function as a global function, but within the file it computed the answer at compile time and placed the hardcoded answer in the code instead of calling the function:
int mul_test04 ( int a, int b )
{
int c;
c = ((long long)a*b) >> 31;
return(c);
}
in another function in the same file:
hexstring(mul_test04(0x40000000,0x40000000),1);
The function itself is implemented in the code:
0800048c <mul_test04>:
800048c: fb81 1000 smull r1, r0, r1, r0
8000490: 0fc9 lsrs r1, r1, #31
8000492: ea41 0040 orr.w r0, r1, r0, lsl #1
8000496: 4770 bx lr
but where it is called they have hardcoded the answer because they had all the information needed to do so:
8000520: f04f 5000 mov.w r0, #536870912 ; 0x20000000
8000524: 2101 movs r1, #1
8000526: f7ff fe73 bl 8000210 <hexstring>
If you dont want the hardcoded answer you need to use a function that is not in the same optimization pass.
Manipulating the compiler and optimizer comes down to a lot of practice and it is not an exact science as the compilers and optimizers are constantly evolving (for better or worse).
By isolating a small bit of code in a function you are causing problems in another way, larger functions are more likely to need a stack frame and evict variables from registers to the stack as it goes, smaller functions might not need to do that and the optimizers may change how the code is implemented as a result. You test the code fragment one way to see what the compiler is doing then use it in a larger function and dont get the result you want. If there is an exact instruction or sequence of instructions you want implemented....Implement them in assembler. If you were targeting a specific set of instructions in a specific instruction set/processor then avoid the game, avoid your code changing when you change computers/compilers/etc, and just use assembler for that target. if needed ifdef or otherwise use conditional compile options to build for different targets without the assembler.
GCC supports actual fixed-point types: http://gcc.gnu.org/onlinedocs/gcc/Fixed_002dPoint.html
I'm not sure what instruction it will use, but it might make you life easier.

Resources