Please consider the below program:
#include <stdio.h>
void my_f(int);
int main()
{
int i = 15;
my_f(i);
}
void my_f(int i)
{
int j[2] = {99, 100};
printf("%d\n", j[-2]);
}
My understanding is that the activation record (aka stack frame) for my_f() should look like this:
------------
| i | 15
------------
| Saved PC | Address of next instruction in caller function
------------
| j[0] | 99
------------
| j[1] | 100
------------
I expected j[-2] to print 15, but it prints 0. Could someone please explain what I am missing here? I am using GCC 4.0.1 on OS X 10.5.8 (Yes, I live under a rock, but that's besides the point here).
If you ever actually want the address of your stack frame in GNU C, use
__builtin_frame_address(0) (non-zero args attempt to backtrace up the stack to parent stack frames). This is the address of the first thing pushed by the function, i.e. a saved ebp or rbp if you compiled with -fno-omit-frame-pointer. If you want to modify the return address on the stack, you might be able to do that with an offset from __builtin_frame_address(0), but to just read it reliably use __builtin_return_address(0).
GCC keeps the stack 16byte-aligned in the usual x86 ABIs. There could easily be a gap between the return address and j[1]. In theory, it could put j[] as far down as it wanted, or optimize it away (or to a read-only static constant, since nothing writes it).
If you compiled with optimization, i probably isn't stored anywhere, and
my_f(int i) is inlined into main.
Also, like #EOF said, j[-2] is two spots below the bottom of your diagram. (Low addresses are at the bottom, because the stack grows down). Also note that the diagram on wikipedia (from the link I edited into the question) is drawn with low addresses at the top. The ASCII diagram in my answer has low addresses at the bottom.
If you compiled with -O0, then there's some hope. In 64bit code (the default target for 64bit builds of gcc and clang), the calling convention passes the first 6 args in registers, so the only i in memory will be in main's stack frame.
Also, in AMD64 code, j[3] might be the upper half of the return address (or the saved %rbp), if j[] is placed below one of those with no gap. (pointers are 64bit, int is still 32 bits.) j[2], the first out-of-bounds element, would alias onto the low 32bits (aka low dword in Intel terminology, where a "word" is 16 bits.)
The best hope for this to work is in un-optimized 32bit code,
using a calling convention with no register-args. (e.g. the x86 32bit SysV ABI. See also the x86 tag wiki).
In that case, your stack will look like:
# 32bit stack-args calling convention, unoptimized code
higher addresses
^^^^^^^^^^^^
| argv |
------------
| argc |
-------------------
| main's ret addr |
-------------------
| ... |
| main()'s local variables and stuff, layout decided by the compiler
| ... |
------------
| i | # function arg
------------ <-- 16B-aligned boundary for the first arg, as required in the ABI
| ret addr |
------------ <--- esp pointer on entry to the function
|saved ebp | # because gcc -m32 -O0 uses -fno-omit-frame-pointer
------------ <--- ebp after mov ebp, esp (part of no-omit-frame-pointer)
unpredictable amount of padding, up to the compiler. (likely 0 bytes in this case)
but actually not: clang 3.5 for example makes a copy of it's arg (`i`) here, and puts j[] right below that, so j[2] or j[5] will work
------------
| j[1] |
------------
| j[0] |
------------
| |
vvvvvvvvvvvv Lower addresses. (The wikipedia diagram is upside-down, IMO: it has low addresses at the top).
It's somewhat likely that the 8 byte j array will be placed right below the value written by push ebp, with no gap. That would make j[0] 16B-aligned, although there's no requirement or guarantee that local arrays have any particular alignment. (Except that C99 variable-length arrays are 16B-aligned, in the AMD64 SysV ABI. I don't remember there being a guarantee for non-variable length arrays, but I didn't check.)
If the function saved any other call-preserved registers (like ebx) so it could use them, those saved registers would be before or after the saved ebp, above space used for locals.
j[4] might work in 32bit code, like #EOF suggested. I assume he arrived at 4 by the same reasoning I did, but forgot to mention that it only applies to 32bit code.
Looking at the asm:
Of course, at what really happens is much better than all this guessing and hand-waving.
I put your function on the Godbolt compiler explorer, with the oldest gcc version it has (4.4.7), using -xc -O0 -Wall -fverbose-asm -m32. -xc is to compile as C, not C++.
my_f:
push ebp #
mov ebp, esp #,
sub esp, 40 #, # no idea why it reserves 40 bytes. clang 3.5 only reserves 24
mov DWORD PTR [ebp-16], 99 # j[0]
mov DWORD PTR [ebp-12], 100 # j[1]
mov edx, DWORD PTR [ebp+0] ###### This is the j[4] load
mov eax, OFFSET FLAT:.LC0 # put the format string address into eax
mov DWORD PTR [esp+4], edx # store j[4] on the stack, to become an arg for printf
mov DWORD PTR [esp], eax # store the format string
call printf #
leave
ret
So gcc puts j at ebp-16, not the ebp-8 that I guessed. j[4] gets the saved ebp. i is at j[6], 8 more bytes up the stack.
Remember, all we've learned here is what gcc 4.4 happens to do at -O0. There's no rule that says j[6] will refer to a location that holds a copy of i on any other setup, or with different surrounding code.
If you want to learn asm from compiler output, look at the asm from -Og or -O1 at least. -O0 stores everything to memory after every statement, so it's very noisy / bloated, which makes it harder to follow. Depending on what you want to learn, -O3 is good. Obviously you have to write functions that do something with input parameters instead of compile-time constants, so they don't optimize away. See How to remove "noise" from GCC/clang assembly output? (especially the link to Matt Godbolt's CppCon2017 talk), and other links in the x86 tag wiki.
clang 3.5.
As noted in the diagram, copies i from the arg slot to a local. Although when it calls printf, it copies from the arg slot again, not the copy inside its own stack frame.
In theory you are right but practically it depends on a lot of issues. These are e.g. the calling conventions, operating system type and version, and also on the compiler type and version.
You can only tell that specifically by looking at the final disassembly of your code.
Related
This question is similar to Incorrect relative address using GNU LD w/ 16-bit x86, but I could not solve by building a cross-compiler.
The scenario is, I have a second stage bootloader that starts as 16bit, and brings itself up to 32 bit. As a result, I have mixed assembly and C code with 32 and 16 bit code mixed together.
I have included an assembly file which defines a global that I will call from C, basically with the purpose of dropping back to REAL mode to perform BIOS interrupts from the protected mode C environment on demand. So far, the function doesn't do anything except get called, push and pop some registers, and return:
[bits 32]
BIOS_Interrupt:
PUSHF
...
ret
global BIOS_Interrupt
this is included in my main bootloader.asm file that is loaded by the stage 1 mbr.
In C, I have defined:
extern void BIOS_Interrupt(uint32_t intno, uint32_t ax, uint32_t bx, uint32_t cx, uint32_t dx, uint32_t es, uint32_t ds, uint32_t di, uint32_t si);
in a header, and
BIOS_Interrupt(0x15,0,0,0,0,0,0,0,0);
in code, just to test calling
I can see in the resultant disassembled linked binary that the call is invariably set 2 bytes too low in RAM:
00000132 0100 add [bx+si],ax
00000134 009CFA9D add [si-0x6206],bl
00000138 6650 push eax
0000013A 6653 push ebx
...
00001625 6A00 push byte +0x0
00001627 6A00 push byte +0x0
00001629 6A00 push byte +0x0
0000162B 6A00 push byte +0x0
0000162D 6A00 push byte +0x0
0000162F 6A00 push byte +0x0
00001631 6A00 push byte +0x0
00001633 6A00 push byte +0x0
00001635 6A15 push byte +0x15
00001637 E8F9EA call 0x133
The instruction at 135 should be the first instruction reached (0x9C = PUSHF), but the call is for 2 bytes less in memory at 133, causing runtime errors.
I have noticed that by using the NASM .align keyword, the extra NOPs that are generated do compensate for the incorrect relative address.
Is this an issue with the linking process? I have LD running with -melf_i386, NASM with -f elf and GCC with -m32 -ffreestanding -0s -fno-pie -nostdlib
edit: images added for #MichaelPetch. Code is loaded at 0x9000 by MBR. Interestingly, the call shows a correct relative jump, to 0x135, but the executing disassembly at 0x135 looks like the code at 0x133 (0x00, 0x00).
Bochs about to call BIOS_Interrupt
Bochs at call start
edit 2: correction to image 2 after refreshing memdump after call
memdump and dissasembly after calling BIOS_Interrupt (call 0x135)
Thanks again to #MichaelPetch for giving me a few pointers.
I don't think there is an issue with the linker, and that the dissassembly was "tricking" me, in that the combination of 16 and 32 bit code led to inaccurate code.
In the end, it was due to overriding of memory values from prior operations. In the code immediately before the BIOS_Interrupt label, I had defined a dword, dd IDT_REAL, designed to store the IDT for real mode processing. However, I did not realise (or forgot) that the SIDT/LIDT instructions take 6 bytes of data, so when I was calling SIDT, it was overriding the first 2 bytes of the label's location in RAM, resulting in runtime errors. After increasing the size of the variable from dword to qword, I can run just fine w/o error.
The linker/compiler suggestion seems to be a red-herring that I fell for courtesy of objdump. However, I've at least learned from this the benefits of Bochs and double checking code before jumping to conclusions!
I am modelling a custom MOV instruction in the X86 architecture in the gem5 simulator, to test its implementation on the simulator, I need to compile my C code using inline assembly to create a binary file. But since it a custom instruction which has not been implemented in the GCC compiler, the compiler will throw out an error. I know one way is to extend the GCC compiler to accept my custom X86 instruction, but I do not want to do it as it is more time consuming(but will do it afterwards).
As a temporary hack (just to check if my implementation is worth it or not). I want to edit an already MOV instruction while changing its underlying "micro ops" in the simulator so as to trick the GCC to accept my "custom" instruction and compile.
As they are many types of MOV instructions which are available in the x86 architecture. As they are various MOV Instructions in the 86 architecture reference.
Therefore coming to my question, which MOV instruction is the least used and that I can edit its underlying micro-ops. Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers and my instructions mirrors the same implementation of a MOV instruction.
Your best bet is regular mov with a prefix that GCC will never emit on its own. i.e. create a new mov encoding that includes a mandatory prefix in front of any other mov. Like how lzcnt is rep bsr.
Or if you're modifying GCC and as, you can add a new mnemonic that just uses otherwise-invalid (in 64-bit mode) single byte opcodes for memory-source, memory-dest, and immediate-source versions of mov. AMD64 freed up several opcodes, including the BCD instructions like AAM, and push/pop most segment registers. (x86-64 can still mov to/from Sregs, but there's just 1 opcode per direction, not 2 per Sreg for push ds/pop ds etc.)
Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers
Bad assumption for XMM: GCC aggressively uses 16-byte movaps / movups instead of copying structs 4 or 8 bytes at a time. It's not at all rare to find vector mov instructions in scalar integer code as part of inline expansion of small known-length memcpy or struct / array init. Also, those mov instructions have at least 2-byte opcodes (SSE1 0F 28 movaps, so a prefix in front of plain mov is the same size as your idea would have been).
However, you're right about MMX regs. I don't think modern GCC will ever emit movq mm0, mm1 or use MMX at all, unless you use MMX intrinsics. Definitely not when targeting 64-bit code.
Also mov to/from control regs (0f 21/23 /r) or debug registers (0f 20/22 /r) are both the mov mnemonic, but gcc will definitely never emit either on its own. Only available with GP register operands as the operand that isn't the debug or control register. So that's technically the answer to your title question, but probably not what you actually want.
GCC doesn't parse its inline asm template string, it just includes it in its asm text output to feed to the assembler after substituting for %number operands. So GCC itself is not an obstacle to emitting arbitrary asm text using inline asm.
And you can use .byte to emit arbitrary machine code.
Perhaps a good option would be to use a 0E byte as a prefix for your special mov encoding that you're going to make GEM decode specially. 0E is push CS in 32-bit mode, invalid in 64-bit mode. GCC will never emit either.
Or just an F2 repne prefix; GCC will never emit repne in front of a mov opcode (where it doesn't apply), only movs. (F3 rep / repe means xrelease when used on a memory-destination instruction so don't use that. https://www.felixcloutier.com/x86/xacquire:xrelease says that F2 repne is the xacquire prefix when used with locked instructions, which doesn't include mov to memory so it will be silently ignored there.)
As usual, prefixes that don't apply have no documented behaviour, but in practice CPUs that don't understand a rep / repne ignore it. Some future CPU might understand it to mean something special, and that's exactly what you're doing with GEM.
Picking .byte 0x0e; instead of repne; might be a better choice if you want to guard against accidentally leaving these prefixes in a build you run on a real CPU. (It will #UD -> SIGILL in 64-bit mode, or usually crash from messing up the stack in 32-bit mode.) But if you do want to be able to run the exact same binary on a real CPU, with the same code alignment and everything, then an ignored REP prefix is ideal.
Using a prefix in front of a standard mov instruction has the advantage of letting the assembler encode the operands for you:
template<class T>
void fancymov(T& dst, T src) {
// fixme: imm -> mem needs a size suffix, defeating template
// unless you use Intel-syntax where the operand includes "dword ptr"
asm("repne; movl %1, %0"
#if 1
: "=m"(dst)
: "ri" (src)
#else
: "=g,r"(dst)
: "ri,rmi" (src)
#endif
: // no clobbers
);
}
void test(int *dst, long src) {
fancymov(*dst, (int)src);
fancymov(dst[1], 123);
}
(Multi-alternative constraints let the compiler pick either reg/mem destination or reg/mem source. In practice it prefers the register destination even when that will cost it another instruction to do its own store, so that sucks.)
On the Godbolt compiler explorer, for the version that only allows a memory-destination:
test(int*, long):
repne; movl %esi, (%rdi) # F2 E9 37
repne; movl $123, 4(%rdi) # F2 C7 47 04 7B 00 00 00
ret
If you wanted this to be usable for loads, I think you'd have to make 2 separate versions of the function and use the load version or store version manually, where appropriate, because GCC seems to want to use reg,reg whenever it can.
Or with the version allowing register outputs (or another version that returns the result as a T, see the Godbolt link):
test2(int*, long):
repne; mov %esi, %esi
repne; mov $123, %eax
movl %esi, (%rdi)
movl %eax, 4(%rdi)
ret
I am working on a nice tool, which requires the atomic swap of two different 64-bit values. On the amd64 architecture it is possible with the XCHGQ instruction (see here in doc, warning: it is a long pdf).
Correspondingly, gcc has some atomic builtins which would ideally do the same, as it is visible for example here.
Using these 2 docs I produced the following simple C function, for the atomic swapping of two, 64-bit values:
void theExchange(u64* a, u64* b) {
__atomic_exchange(a, b, b, __ATOMIC_SEQ_CST);
};
(Btw, it wasn't really clear to me, why needs an "atomic exchange" 3 operands.)
It was to me a little bit fishy, that the gcc __atomic_exchange macro uses 3 operands, so I tested its asm output. I compiled this with a gcc -O6 -masm=intel -S and I've got the following output:
.LHOTB0:
.p2align 4,,15
.globl theExchange
.type theExchange, #function
theExchange:
.LFB16:
.cfi_startproc
mov rax, QWORD PTR [rsi]
xchg rax, QWORD PTR [rdi] /* WTF? */
mov QWORD PTR [rsi], rax
ret
.cfi_endproc
.LFE16:
.size theExchange, .-theExchange
.section .text.unlikely
As we can see, the result function contains not only a single data move, but three different data movements. Thus, as I understood this asm code, this function won't be really atomic.
How is it possible? Maybe I misunderstood some of the docs? I admit, the gcc builtin doc wasn't really clear to me.
This is the generic version of __atomic_exchange_n (type *ptr, type val, int memorder) where only the exchange operation on ptr is atomic, the reading of val is not. In the generic version, val is accessed via pointer, but the atomicity still does not apply to it. The pointer is so that it will work with multiple sizes, when the compiler has to call an external helper:
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in
function if the specific data type size makes that possible;
otherwise, an external call is left to be resolved at run time. This
external call is the same format with the addition of a ‘size_t’
parameter inserted as the first parameter indicating the size of the
object being pointed to. All objects must be the same size.
I have the inline assembly code:
#define read_msr(index, buf) asm volatile ("rdmsr" : "=d"(buf[1]), "=a"(buf[0]) : "c"(index))
The code using this macro:
u32 buf[2];
read_msr(0x173, buf);
I found the disassembly is (using gnu toolchain):
mov eax,0x173
mov ecx,eax
rdmsr
mov DWORD PTR [rbp-0xc],edx
mov DWORD PTR [rbp-0x10],eax
The question is that 0x173 is less than 0xffff, why gcc does not use mov cx, 0x173? Will the gcc analysis the following instruction rdmsr? Will the gcc always know the correct register size?
It depends on the size of the value or variable passed.
If you pass a "short int" it will set "cx" and read the data from "ax" and "dx" (if buf is a short int, too).
For char it would access "cl" and so on.
So "c" refers to the "ecx" register, but this is accessed with "ecx", "cx", or "cl" depending on the size of the access, which I think makes sense.
To test you can try passing (unsigned short)0x173, it should change the code.
There is no analysis of the inline assembly (in fact it is after text substitution direclty copied to the output assembly, including syntax errors). Also there is no default register size, depending on whether you have a 32 or 64 bit target. This would be way to limiting.
I think the answer is because the current default data size is 32-bit. In 64-bit long mode, the default data size is also 32-bit, unless you use "rex.w" prefix.
Intel specifies the RDMSR instruction as using (all of) ECX to determine the model specific register. That being the case, and apparently as specified by your macro, GCC has every reason to load your constant into the full ECX.
So the question about why it doesn't load CX seems completely inappropriate. It looks like GCC is generating the right code.
(You didn't ask why it stages the load of ECX inefficiently by using EAX; I don't know the answer to that).
I write my own link script to put different variables in two different data sections (A & B).
A is linked to zero address;
B is linked near to code, and in high address space (higher than 4G, which is not available for normal absolute addressing in x86-64).
A can be accessed through absolute addressing, but not RIP-relative;
B can be accessed through RIP-relative addressing, but not absolute;
My question: Is there any way to choose RIP-relative or absolute addressing for different variables in gcc? Perhaps with some annotation like #pragma?
Without hacking the GCC source code, you're not going to get it to emit 32-bit absolute addressing, but there are cases where gcc will use 64-bit absolute addresses.
-mcmodel=medium puts large objects into a separate section, using 64-bit absolute addresses for the large-data section. (With a size threshold that all objects have to agree on, set by -mlarge-data-threshold=). But still uses RIP-relative for all other variables.
See the x86-64 System V ABI doc for more about the different memory models. And/or GCC docs for -mcmodel= and -mlarge-data-threshold= : https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
The default is -mcmodel=small : everything is within 2GiB of everything else, so RIP-relative works. And for non-PIE executables, that's the low 2GiB of virtual address space so static addresses can be 32-bit absolute sign- or zero-extended immediates or disp32 in addressing modes.
int a[1000000];
int b[1];
int fa() { return a[0]; }
int fb() { return b[0]; }
ASM output (Godbolt):
# gcc9.2 -O3 -mcmodel=medium
fa():
movabs eax, DWORD PTR [a] # 64-bit absolute address, special encoding for EAX
ret
fb():
mov eax, DWORD PTR b[rip]
ret
For loading into a register other than AL/AX/EAX/RAX, GCC would use movabs r64, imm64 with the address and then use mov reg, [reg].
You won't get gcc to use 32-bit absolute addressing for section A. It will always be using 64-bit absolute, never [array + rdx*4] or [abs foo] (NASM syntax). And never mov edi, msg (imm32) for putting an address in a register, always mov rdi, qword msg (imm64).
GCC puts b in the .lbss section and a in the regular .bss. Presumably you can use __attribute__((section("name"))) on
.globl b
.section .lbss,"aw" # "aw" = allocate(?), writeable
.align 32
.size b, 4000000
b:
.zero 4000000
.globl a
.bss # shortcut for .section
.align 4
a:
.zero 4
Things that don't work:
__attribute__((optimize("mcmodel=large"))) on a per-function basis. Doesn't actually work, and is per-function not per-variable anyway.
https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html doesn't document any x86 or common variable attributes related to memory-model or size. The only x86-specific variable attribute is ms vs gcc struct layout.
There are x86-specific attributes for functions and types, but those don't help.
Possible hacks:
Put all your section-A variables in a large struct, larger than any section-B global/static objects. Possibly pad it at the end with a dummy array to make it larger: your linker script can probably avoid actually allocating extra space for that dummy array.
Then compile with -mcmodel=medium mlarge-data-threshold=that size.