What is asm instruction "jmpq *0xa48201(%rip)" exactly doing? [duplicate]

What is asm instruction "jmpq *0xa48201(%rip)" exactly doing? [duplicate] - gcc

How is the address 0x600860 computed in the Intel instruction below? 0x4003b8 + 0x2004a2 = 60085a, so I don't see how the computation is carried out.
0x4003b8 <puts#plt>: jmpq *0x2004a2(%rip) # 0x600860 <puts#got.plt>

On Intel, JMP, CALL, etc. are relative to the program counter of the next instruction.
The next instruction in your case was at 0x4003be, and 0x4003be + 0x2004a2 == 0x600860

It's AT&T syntax for a memory-indirect JMP with a RIP-relative addressing mode.
The jump address is fetched from the memory location that is specified relative to the instruction pointer:
first calculate 0x4003be + 0x2004a2 == 0x600860 then fetch the address to jump to from location 0x600860.
Other addressing modes are possible, for example a jump-table might use
jmpq *(%rdi, %rax, 8) with the table base in RDI and the index in RAX.
RIP-relative addressing for static data is common, though. In this case, it's addressing an entry in the GOT (Global Offset Table), set up by dynamic linking.

Related

Ask for clarification about "the segment registers continue to point to the same linear addresses as in real address mode" [duplicate]

This question already has an answer here:
How can the x86 processor fetch the instruction just after GDT is loaded by a bootloader?
(1 answer)
Closed 1 year ago.
The question is about persistent validity of code segment selector while switching from real mode to protected mode on intel i386. The switching code is as follows (excerpted from bootasm.S of xv6 x86 version):
9138 # Switch from real to protected mode. Use a bootstrap GDT that makes
9139 # virtual addresses map directly to physical addresses so that the
9140 # effective memory map doesn’t change during the transition.
9141 lgdt gdtdesc
9142 movl %cr0, %eax
9143 orl $CR0_PE, %eax
9144 movl %eax, %cr0
9150 # Complete the transition to 32−bit protected mode by using a long jmp
9151 # to reload %cs and %eip. The segment descriptors are set up with no
9152 # translation, so that the mapping is still the identity mapping.
9153 ljmp $(SEG_KCODE<<3), $start32
The GDT layout is as follows:
9182 gdt:
9183 SEG_NULLASM # null seg
9184 SEG_ASM(STA_X|STA_R, 0x0, 0xffffffff) # code seg
9185 SEG_ASM(STA_W, 0x0, 0xffffffff) # data seg
After executing line 9144, the processor switches to protected mode in which mere segment memory management is enabled (but paging has not yet been enabled). My understanding is that, since segment MM has been enabled, the fetching of the following instruction should conform to the rules of segment MM. At this point (immediately before line 9153), however, the code selector remains 0, which in my understanding means the code segment should have selected the zero-th descriptor in GDT, which is null. But my question comes out naturally, how such a null descriptor can load the supposed ljmp instruction? I tried to answer my question by googling, and a document gives some explanation as follows: http://www.logix.cz/michal/doc/i386/chp10-03.htm#10-03
The segment registers continue to point to the same linear addresses
as in real address mode
This sentence seems to answer my question: if the segment registers continue to point to the same linear addresses, the next instruction should be the same as in real mode, that is, ljmp. But I immediately have a sequence of new questions: why can the segment selector "continue to point to the same linear addresses"? Hasn't the processor been changed to protected mode? Doesn't the value of 0 in %cs point to the zero-th descriptor, instead of the 1st (set in line 9184) which is the supposed descriptor to fetch ljmp instruction? How does the x86 CPU magically know it is the ljmp that is the next instruction it should execute? Where is the description in any manual that describe this magic? I tried to persuade myself that the ljmp has been prefetched in the processor's instruction queue, but the second paragraph of the same webpage tells me that the prefetched ljmp, if any, has been invalidated so the CPU should fetch the next instruction afresh. Can you please give me some clarification of how "the segment registers continue to point to the same linear addresses as in real address mode" magically? Thank you.
PS, the CPU I am working on is intel i386 compatible.

The modern reference is the Intel Software Developer's Manual, Volume 3A, Section 9.9.1, "Switching to protected mode".
Intel isn't big on explaining how magic works internally. What it says, and all you need to know, is that if your movl %eax, %cr0 is immediately followed by a far jump or far call, then everything will work. If you put any other instruction there, then "random failures can occur" (their wording).
As it says, %cs continues to hold its previous value, and presumably that's the value that would be pushed on the stack if you did a far call as the instruction after movl %eax, %cr0. (Where the stack would be is another interesting question - I think everyone uses the jump instead so it rarely comes up.) But for this one instruction it evidently isn't used as a selector in the usual way.
One guess as to how it might work: we know that in protected mode, there are hidden registers that store the segment attributes, and are reloaded from the descriptor table when you load a segment register. So the movl %eax, %cr0 might cause the hidden register corresponding to %cs to be loaded with attributes of a segment whose base address is the linear address of the current 16-bit segment: e.g. if %cs contained 0x1234 then it could be a segment with base address 0x12340. But the %cs register itself could be left alone, temporarily not matching its hidden counterpart. Then if the high bits of %eip are zeroed, the next instruction would be fetched from the right place. That instruction is required to be the long jump which will reload %cs as well as the hidden segment attribute register.
It's also possible that it just sets some internal flag that says "even though in protected mode, fetch the next instruction according to real-mode address translation". Then this flag gets cleared when a far jump occurs, or after one instruction has been fetched, or something like that.

which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension

I am modelling a custom MOV instruction in the X86 architecture in the gem5 simulator, to test its implementation on the simulator, I need to compile my C code using inline assembly to create a binary file. But since it a custom instruction which has not been implemented in the GCC compiler, the compiler will throw out an error. I know one way is to extend the GCC compiler to accept my custom X86 instruction, but I do not want to do it as it is more time consuming(but will do it afterwards).
As a temporary hack (just to check if my implementation is worth it or not). I want to edit an already MOV instruction while changing its underlying "micro ops" in the simulator so as to trick the GCC to accept my "custom" instruction and compile.
As they are many types of MOV instructions which are available in the x86 architecture. As they are various MOV Instructions in the 86 architecture reference.
Therefore coming to my question, which MOV instruction is the least used and that I can edit its underlying micro-ops. Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers and my instructions mirrors the same implementation of a MOV instruction.

Your best bet is regular mov with a prefix that GCC will never emit on its own. i.e. create a new mov encoding that includes a mandatory prefix in front of any other mov. Like how lzcnt is rep bsr.
Or if you're modifying GCC and as, you can add a new mnemonic that just uses otherwise-invalid (in 64-bit mode) single byte opcodes for memory-source, memory-dest, and immediate-source versions of mov. AMD64 freed up several opcodes, including the BCD instructions like AAM, and push/pop most segment registers. (x86-64 can still mov to/from Sregs, but there's just 1 opcode per direction, not 2 per Sreg for push ds/pop ds etc.)
Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers
Bad assumption for XMM: GCC aggressively uses 16-byte movaps / movups instead of copying structs 4 or 8 bytes at a time. It's not at all rare to find vector mov instructions in scalar integer code as part of inline expansion of small known-length memcpy or struct / array init. Also, those mov instructions have at least 2-byte opcodes (SSE1 0F 28 movaps, so a prefix in front of plain mov is the same size as your idea would have been).
However, you're right about MMX regs. I don't think modern GCC will ever emit movq mm0, mm1 or use MMX at all, unless you use MMX intrinsics. Definitely not when targeting 64-bit code.
Also mov to/from control regs (0f 21/23 /r) or debug registers (0f 20/22 /r) are both the mov mnemonic, but gcc will definitely never emit either on its own. Only available with GP register operands as the operand that isn't the debug or control register. So that's technically the answer to your title question, but probably not what you actually want.
GCC doesn't parse its inline asm template string, it just includes it in its asm text output to feed to the assembler after substituting for %number operands. So GCC itself is not an obstacle to emitting arbitrary asm text using inline asm.
And you can use .byte to emit arbitrary machine code.
Perhaps a good option would be to use a 0E byte as a prefix for your special mov encoding that you're going to make GEM decode specially. 0E is push CS in 32-bit mode, invalid in 64-bit mode. GCC will never emit either.
Or just an F2 repne prefix; GCC will never emit repne in front of a mov opcode (where it doesn't apply), only movs. (F3 rep / repe means xrelease when used on a memory-destination instruction so don't use that. https://www.felixcloutier.com/x86/xacquire:xrelease says that F2 repne is the xacquire prefix when used with locked instructions, which doesn't include mov to memory so it will be silently ignored there.)
As usual, prefixes that don't apply have no documented behaviour, but in practice CPUs that don't understand a rep / repne ignore it. Some future CPU might understand it to mean something special, and that's exactly what you're doing with GEM.
Picking .byte 0x0e; instead of repne; might be a better choice if you want to guard against accidentally leaving these prefixes in a build you run on a real CPU. (It will #UD -> SIGILL in 64-bit mode, or usually crash from messing up the stack in 32-bit mode.) But if you do want to be able to run the exact same binary on a real CPU, with the same code alignment and everything, then an ignored REP prefix is ideal.
Using a prefix in front of a standard mov instruction has the advantage of letting the assembler encode the operands for you:
template<class T>
void fancymov(T& dst, T src) {
// fixme: imm -> mem needs a size suffix, defeating template
// unless you use Intel-syntax where the operand includes "dword ptr"
asm("repne; movl %1, %0"
#if 1
: "=m"(dst)
: "ri" (src)
#else
: "=g,r"(dst)
: "ri,rmi" (src)
#endif
: // no clobbers
);
}
void test(int *dst, long src) {
fancymov(*dst, (int)src);
fancymov(dst[1], 123);
}
(Multi-alternative constraints let the compiler pick either reg/mem destination or reg/mem source. In practice it prefers the register destination even when that will cost it another instruction to do its own store, so that sucks.)
On the Godbolt compiler explorer, for the version that only allows a memory-destination:
test(int*, long):
repne; movl %esi, (%rdi) # F2 E9 37
repne; movl $123, 4(%rdi) # F2 C7 47 04 7B 00 00 00
ret
If you wanted this to be usable for loads, I think you'd have to make 2 separate versions of the function and use the load version or store version manually, where appropriate, because GCC seems to want to use reg,reg whenever it can.
Or with the version allowing register outputs (or another version that returns the result as a T, see the Godbolt link):
test2(int*, long):
repne; mov %esi, %esi
repne; mov $123, %eax
movl %esi, (%rdi)
movl %eax, 4(%rdi)
ret

Is atomic.LoadUint32 necessary?

Go's atomic package provides function func LoadUint32(addr *uint32) (val uint32). I looked into the assembly implementation:
TEXT ·LoadUint32(SB),NOSPLIT,$0-12
MOVQ addr+0(FP), AX
MOVL 0(AX), AX
MOVL AX, val+8(FP)
RET
which basically load the value from the memory address and return it.
I'm wondering if we have a uint32 pointer(addr) x, what is the difference between calling atomic.LoadUint32(x) and directly access it using *x?

which basically load the value from the memory address and return it.
That is the case in your context, but might differ on a different machine architecture where atomicity is to be implemented, as discussed here.
As mentioned in go issue 8739
We intrinsify both sync/atomic and runtime/internal/atomic for a bunch of architectures.
The APIs are not unified (e.g. LoadUint32 in sync/atomic is Load in runtime/internal/atomic).
(* "intrinsify" as in issue 4947)
As mentioned in my first link:
Regarding loads and stores.
Memory model along with instruction set specifies whether plain loads and stores are atomic or not. Typical guarantee for all modern commodity hardware is that aligned word-sized loads and stores are atomic. For example, on x86 architecture (IA-32 and Intel 64) 1-, 2-, 4-, 8- and 16-byte aligned loads and stores are all atomic (that is, plain MOV instruction, MOVQ and MOVDQA are atomic).

Assemble IA-32 mov [bootdrv], dl

I just start to program IA-32 assemble and boot loader and I can't understand one command: mov [bootdrv], dl.
dl is the low 8 bits of data register, but I dont know what is [bootdrv]. Is it a variable or something? How could a register be placed in [bootdrv]?
start:
mov ax,0x7c0 ; BIOS puts us at 0:07C00h, so set DS accordinly
mov ds,ax ; Therefore, we don't have to add 07C00h to all our data
mov [bootdrv], dl ; quickly save what drive we booted from
This is the beginning 3 line of a boot loader and [bootdrv] just appear without any definition, I couldn't understand.
Any information would be helpful and appreciated, thank you!

[bootdrv] is a specification of an absolute memory address. The code:
mov [bootdrv], dl
copies the contents of the 8-bit DL register into a byte in memory, at the address resulting of multiplying the current value of DS by 16, then add the value bootdrv. bootdrv itself is a label, which a value that represents where in the current data segment is the memory position located.
On the other hand, the symbol bootdrv must be defined somewhere. Otherwise, the assembler will stop with a "symbol not defined" error. Maybe it's defined past the code (assemblers do two passes through the source code in order to get all symbols so they can be used even if they are defined after the code sequence that uses them). Maybe it's in a separate .INC file.

mov [bootdrv], dl indicates a segment:offset memory access. In the previous instruction, you configured the Data Segment register with an address, so the mov [bootdrv], dl instruction writes to the segment:offset address 0x7c0:bootdrv, whatever bootdrv might be.

Can I choose RIP-relative or absolute addressing for different variables with gcc in x86-64

I write my own link script to put different variables in two different data sections (A & B).
A is linked to zero address;
B is linked near to code, and in high address space (higher than 4G, which is not available for normal absolute addressing in x86-64).
A can be accessed through absolute addressing, but not RIP-relative;
B can be accessed through RIP-relative addressing, but not absolute;
My question: Is there any way to choose RIP-relative or absolute addressing for different variables in gcc? Perhaps with some annotation like #pragma?

Without hacking the GCC source code, you're not going to get it to emit 32-bit absolute addressing, but there are cases where gcc will use 64-bit absolute addresses.
-mcmodel=medium puts large objects into a separate section, using 64-bit absolute addresses for the large-data section. (With a size threshold that all objects have to agree on, set by -mlarge-data-threshold=). But still uses RIP-relative for all other variables.
See the x86-64 System V ABI doc for more about the different memory models. And/or GCC docs for -mcmodel= and -mlarge-data-threshold= : https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
The default is -mcmodel=small : everything is within 2GiB of everything else, so RIP-relative works. And for non-PIE executables, that's the low 2GiB of virtual address space so static addresses can be 32-bit absolute sign- or zero-extended immediates or disp32 in addressing modes.
int a[1000000];
int b[1];
int fa() { return a[0]; }
int fb() { return b[0]; }
ASM output (Godbolt):
# gcc9.2 -O3 -mcmodel=medium
fa():
movabs eax, DWORD PTR [a] # 64-bit absolute address, special encoding for EAX
ret
fb():
mov eax, DWORD PTR b[rip]
ret
For loading into a register other than AL/AX/EAX/RAX, GCC would use movabs r64, imm64 with the address and then use mov reg, [reg].
You won't get gcc to use 32-bit absolute addressing for section A. It will always be using 64-bit absolute, never [array + rdx*4] or [abs foo] (NASM syntax). And never mov edi, msg (imm32) for putting an address in a register, always mov rdi, qword msg (imm64).
GCC puts b in the .lbss section and a in the regular .bss. Presumably you can use __attribute__((section("name"))) on
.globl b
.section .lbss,"aw" # "aw" = allocate(?), writeable
.align 32
.size b, 4000000
b:
.zero 4000000
.globl a
.bss # shortcut for .section
.align 4
a:
.zero 4
Things that don't work:
__attribute__((optimize("mcmodel=large"))) on a per-function basis. Doesn't actually work, and is per-function not per-variable anyway.
https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html doesn't document any x86 or common variable attributes related to memory-model or size. The only x86-specific variable attribute is ms vs gcc struct layout.
There are x86-specific attributes for functions and types, but those don't help.
Possible hacks:
Put all your section-A variables in a large struct, larger than any section-B global/static objects. Possibly pad it at the end with a dummy array to make it larger: your linker script can probably avoid actually allocating extra space for that dummy array.
Then compile with -mcmodel=medium mlarge-data-threshold=that size.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio