Is atomic.LoadUint32 necessary? - go

Go's atomic package provides function func LoadUint32(addr *uint32) (val uint32). I looked into the assembly implementation:
TEXT ·LoadUint32(SB),NOSPLIT,$0-12
MOVQ addr+0(FP), AX
MOVL 0(AX), AX
MOVL AX, val+8(FP)
RET
which basically load the value from the memory address and return it.
I'm wondering if we have a uint32 pointer(addr) x, what is the difference between calling atomic.LoadUint32(x) and directly access it using *x?

which basically load the value from the memory address and return it.
That is the case in your context, but might differ on a different machine architecture where atomicity is to be implemented, as discussed here.
As mentioned in go issue 8739
We intrinsify both sync/atomic and runtime/internal/atomic for a bunch of architectures.
The APIs are not unified (e.g. LoadUint32 in sync/atomic is Load in runtime/internal/atomic).
(* "intrinsify" as in issue 4947)
As mentioned in my first link:
Regarding loads and stores.
Memory model along with instruction set specifies whether plain loads and stores are atomic or not. Typical guarantee for all modern commodity hardware is that aligned word-sized loads and stores are atomic. For example, on x86 architecture (IA-32 and Intel 64) 1-, 2-, 4-, 8- and 16-byte aligned loads and stores are all atomic (that is, plain MOV instruction, MOVQ and MOVDQA are atomic).

Related

What does write_cr0(read_cr0() | 0x10000) do?

I searched the web a lot but didn't find a short explanation about what write_cr0(read_cr0() | 0x10000) really do. It is related to the Linux kernel and I curios about developing LKM's. I want to know what this really do and what are the security issues with this.
It used to remove the write protection on the syscall table.
But how it is really works? and what does each thing in this line?
CR0 is one of the control registers available on x86 CPUs, which contains flags controlling CPU features related to memory protection, multitasking, paging, etc. You can find a full description in Volume 3, Section 2.5 of Intel's Software Developer's Manual.
These registers are accessed by special instructions that the compiler doesn't normally generate, so read_cr0() is a function which executes the instruction to read this register (via inline assembly) and returns the result in a general-purpose register. Likewise, write_cr0() writes to this register.
The function calls are likely to be inlined, so that the generated code would be something like
mov eax, cr0
or eax, 0x10000
mov cr0, eax
The OR with 0x10000 sets bit 16, the Write Protect bit. On early 32-bit x86 CPUs, code running at supervisor level (like the kernel) was always allowed to write all of virtual memory, regardless of whether the page was marked read-only. This bit makes that optional, so that when it is set, such accesses will cause page faults. This line of code probably follows an earlier line which temporarily cleared the bit.

which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension

I am modelling a custom MOV instruction in the X86 architecture in the gem5 simulator, to test its implementation on the simulator, I need to compile my C code using inline assembly to create a binary file. But since it a custom instruction which has not been implemented in the GCC compiler, the compiler will throw out an error. I know one way is to extend the GCC compiler to accept my custom X86 instruction, but I do not want to do it as it is more time consuming(but will do it afterwards).
As a temporary hack (just to check if my implementation is worth it or not). I want to edit an already MOV instruction while changing its underlying "micro ops" in the simulator so as to trick the GCC to accept my "custom" instruction and compile.
As they are many types of MOV instructions which are available in the x86 architecture. As they are various MOV Instructions in the 86 architecture reference.
Therefore coming to my question, which MOV instruction is the least used and that I can edit its underlying micro-ops. Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers and my instructions mirrors the same implementation of a MOV instruction.
Your best bet is regular mov with a prefix that GCC will never emit on its own. i.e. create a new mov encoding that includes a mandatory prefix in front of any other mov. Like how lzcnt is rep bsr.
Or if you're modifying GCC and as, you can add a new mnemonic that just uses otherwise-invalid (in 64-bit mode) single byte opcodes for memory-source, memory-dest, and immediate-source versions of mov. AMD64 freed up several opcodes, including the BCD instructions like AAM, and push/pop most segment registers. (x86-64 can still mov to/from Sregs, but there's just 1 opcode per direction, not 2 per Sreg for push ds/pop ds etc.)
Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers
Bad assumption for XMM: GCC aggressively uses 16-byte movaps / movups instead of copying structs 4 or 8 bytes at a time. It's not at all rare to find vector mov instructions in scalar integer code as part of inline expansion of small known-length memcpy or struct / array init. Also, those mov instructions have at least 2-byte opcodes (SSE1 0F 28 movaps, so a prefix in front of plain mov is the same size as your idea would have been).
However, you're right about MMX regs. I don't think modern GCC will ever emit movq mm0, mm1 or use MMX at all, unless you use MMX intrinsics. Definitely not when targeting 64-bit code.
Also mov to/from control regs (0f 21/23 /r) or debug registers (0f 20/22 /r) are both the mov mnemonic, but gcc will definitely never emit either on its own. Only available with GP register operands as the operand that isn't the debug or control register. So that's technically the answer to your title question, but probably not what you actually want.
GCC doesn't parse its inline asm template string, it just includes it in its asm text output to feed to the assembler after substituting for %number operands. So GCC itself is not an obstacle to emitting arbitrary asm text using inline asm.
And you can use .byte to emit arbitrary machine code.
Perhaps a good option would be to use a 0E byte as a prefix for your special mov encoding that you're going to make GEM decode specially. 0E is push CS in 32-bit mode, invalid in 64-bit mode. GCC will never emit either.
Or just an F2 repne prefix; GCC will never emit repne in front of a mov opcode (where it doesn't apply), only movs. (F3 rep / repe means xrelease when used on a memory-destination instruction so don't use that. https://www.felixcloutier.com/x86/xacquire:xrelease says that F2 repne is the xacquire prefix when used with locked instructions, which doesn't include mov to memory so it will be silently ignored there.)
As usual, prefixes that don't apply have no documented behaviour, but in practice CPUs that don't understand a rep / repne ignore it. Some future CPU might understand it to mean something special, and that's exactly what you're doing with GEM.
Picking .byte 0x0e; instead of repne; might be a better choice if you want to guard against accidentally leaving these prefixes in a build you run on a real CPU. (It will #UD -> SIGILL in 64-bit mode, or usually crash from messing up the stack in 32-bit mode.) But if you do want to be able to run the exact same binary on a real CPU, with the same code alignment and everything, then an ignored REP prefix is ideal.
Using a prefix in front of a standard mov instruction has the advantage of letting the assembler encode the operands for you:
template<class T>
void fancymov(T& dst, T src) {
// fixme: imm -> mem needs a size suffix, defeating template
// unless you use Intel-syntax where the operand includes "dword ptr"
asm("repne; movl %1, %0"
#if 1
: "=m"(dst)
: "ri" (src)
#else
: "=g,r"(dst)
: "ri,rmi" (src)
#endif
: // no clobbers
);
}
void test(int *dst, long src) {
fancymov(*dst, (int)src);
fancymov(dst[1], 123);
}
(Multi-alternative constraints let the compiler pick either reg/mem destination or reg/mem source. In practice it prefers the register destination even when that will cost it another instruction to do its own store, so that sucks.)
On the Godbolt compiler explorer, for the version that only allows a memory-destination:
test(int*, long):
repne; movl %esi, (%rdi) # F2 E9 37
repne; movl $123, 4(%rdi) # F2 C7 47 04 7B 00 00 00
ret
If you wanted this to be usable for loads, I think you'd have to make 2 separate versions of the function and use the load version or store version manually, where appropriate, because GCC seems to want to use reg,reg whenever it can.
Or with the version allowing register outputs (or another version that returns the result as a T, see the Godbolt link):
test2(int*, long):
repne; mov %esi, %esi
repne; mov $123, %eax
movl %esi, (%rdi)
movl %eax, 4(%rdi)
ret

What is asm instruction "jmpq *0xa48201(%rip)" exactly doing? [duplicate]

How is the address 0x600860 computed in the Intel instruction below? 0x4003b8 + 0x2004a2 = 60085a, so I don't see how the computation is carried out.
0x4003b8 <puts#plt>: jmpq *0x2004a2(%rip) # 0x600860 <puts#got.plt>
On Intel, JMP, CALL, etc. are relative to the program counter of the next instruction.
The next instruction in your case was at 0x4003be, and 0x4003be + 0x2004a2 == 0x600860
It's AT&T syntax for a memory-indirect JMP with a RIP-relative addressing mode.
The jump address is fetched from the memory location that is specified relative to the instruction pointer:
first calculate 0x4003be + 0x2004a2 == 0x600860 then fetch the address to jump to from location 0x600860.
Other addressing modes are possible, for example a jump-table might use
jmpq *(%rdi, %rax, 8) with the table base in RDI and the index in RAX.
RIP-relative addressing for static data is common, though. In this case, it's addressing an entry in the GOT (Global Offset Table), set up by dynamic linking.

gcc, __atomic_exchange seems to produce non-atomic asm, why?

I am working on a nice tool, which requires the atomic swap of two different 64-bit values. On the amd64 architecture it is possible with the XCHGQ instruction (see here in doc, warning: it is a long pdf).
Correspondingly, gcc has some atomic builtins which would ideally do the same, as it is visible for example here.
Using these 2 docs I produced the following simple C function, for the atomic swapping of two, 64-bit values:
void theExchange(u64* a, u64* b) {
__atomic_exchange(a, b, b, __ATOMIC_SEQ_CST);
};
(Btw, it wasn't really clear to me, why needs an "atomic exchange" 3 operands.)
It was to me a little bit fishy, that the gcc __atomic_exchange macro uses 3 operands, so I tested its asm output. I compiled this with a gcc -O6 -masm=intel -S and I've got the following output:
.LHOTB0:
.p2align 4,,15
.globl theExchange
.type theExchange, #function
theExchange:
.LFB16:
.cfi_startproc
mov rax, QWORD PTR [rsi]
xchg rax, QWORD PTR [rdi] /* WTF? */
mov QWORD PTR [rsi], rax
ret
.cfi_endproc
.LFE16:
.size theExchange, .-theExchange
.section .text.unlikely
As we can see, the result function contains not only a single data move, but three different data movements. Thus, as I understood this asm code, this function won't be really atomic.
How is it possible? Maybe I misunderstood some of the docs? I admit, the gcc builtin doc wasn't really clear to me.
This is the generic version of __atomic_exchange_n (type *ptr, type val, int memorder) where only the exchange operation on ptr is atomic, the reading of val is not. In the generic version, val is accessed via pointer, but the atomicity still does not apply to it. The pointer is so that it will work with multiple sizes, when the compiler has to call an external helper:
The four non-arithmetic functions (load, store, exchange, and
compare_exchange) all have a generic version as well. This generic
version works on any data type. It uses the lock-free built-in
function if the specific data type size makes that possible;
otherwise, an external call is left to be resolved at run time. This
external call is the same format with the addition of a ‘size_t’
parameter inserted as the first parameter indicating the size of the
object being pointed to. All objects must be the same size.

what is GCC compiler option to get Segment Override prefix in x86

I have memory layout (In Increasing memory addr) like :
Code Section (0-4k), Data Section(4k-8k), Stack Section(8k-12k), CustomData Section(12k-16k).
I have put some special arrays, structs in Custom Data Section.
As i know, Data Segment (#DS)Selector will be used for any Data related compiler code.
So Data Section(4k-8k) will have #DS by default for all operation. Except some str op where ES may be used. Like:
mov $0xc00,%eax
addl $0xd, (%eax)
But, I want to use Extra Segment(#ES) selector for CustomData access. I would define a new GDT entry for ES with different Base and Limit. like:
mov $0x3400,%eax
addl $0xd, %es:(%eax)
So my question is:
Does GCC has any x86 compiler flag, which can be used to tell compiler that use #ES for CustomData Section code access.?
Means, compiler flag which will generate code using #ES for CustomData Section.?
Thanks in advance !!
While the question is asking for an option to make gcc's code-gen use an es prefix for accessing a custom section, IF you wanted to do it in hand-written code, the AT&T syntax already allows e.g. %es:(%eax).
Note, this may break rep-string instructions which gcc sometimes inlines; fs or gs would be the only sane choices, and are still usable even in x86-64.
(Making this comment community wiki based on useful critique by Peter Cordes.)
Quoting the example from the clang language-extension docs
#define GS_RELATIVE __attribute__((address_space(256)))
int foo(int GS_RELATIVE *P) {
return *P;
}
Which compiles to (on X86-32):
_foo:
movl 4(%esp), %eax # load the arg
movl %gs:(%eax), %eax # use it with its prefix
ret
address-space 256 is gs, 257 is fs, and 258 is ss.
The docs don't mention es; compilers normally assume that es=ds so they can freely inline rep movs or rep stos for memcpy / memset if they choose to do so based on tuning options. Library implementations of memcpy or memset may also use rep stos/movs on some CPUs. Related: Why is std::fill(0) slower than std::fill(1)?
Obviously this is really low-level stuff, and only makes sense if you've already set the GS or FS base address. (wrfsbase). Beware that i386 Linux uses gs for thread-local storage in user-space, while x86-64 Linux uses fs.
I'm not aware of extensions like this for gcc, ICC, or MSVC.
Well, there is __thread in GNU C, which will use %gs: or %fs: prefixes as appropriate for the target platform. How does the gcc `__thread` work?.

Resources