Why does GCC avoid vector registers for multi-element unions?

Why does GCC avoid vector registers for multi-element unions? - gcc

I have noticed that GCC generates very different (and less efficient) code when it is given a union of an SIMD vector type and any other same-size and same-alignment type that is not a vector type.
In particular, as can be seen in this Godbolt example, when an __m128 vector type is placed in a union with a non-vector type, the union is passed in two XMM registers (per argument) then loaded onto the stack for use with addps, as opposed to being passed in a single XMM register and used with addps directly. On the other hand, for the other two cases with a union containing only __m128 and the __m128 vector itself, the arguments and return are passed in XMM registers directly and no stack is used.
What causes this discrepancy? Is there a way to "force" GCC to pass the multi-element union in XMM registers?
With union:
#include <immintrin.h>
#include <array>
union simd
{
__m128 vec;
alignas(__m128) std::array<float, 4> values;
};
simd add(simd a, simd b) noexcept
{
simd ret;
ret.vec = _mm_add_ps(a.vec, b.vec);
return ret;
}
add(simd, simd):
movq QWORD PTR [rsp-40], xmm0
movq QWORD PTR [rsp-32], xmm1
movq QWORD PTR [rsp-24], xmm2
movq QWORD PTR [rsp-16], xmm3
movaps xmm4, XMMWORD PTR [rsp-24]
addps xmm4, XMMWORD PTR [rsp-40]
movaps XMMWORD PTR [rsp-40], xmm4
movq xmm1, QWORD PTR [rsp-32]
movq xmm0, QWORD PTR [rsp-40]
ret
Without union:
__m128 add(__m128 a, __m128 b) noexcept
{
return _mm_add_ps(a, b);
}
add(float __vector(4), float __vector(4)):
addps xmm0, xmm1
ret
Note that the second case also applies when the __m128 vector is wrapped in an enclosing struct or union.

As suspected by Homer512, the answer lies in the AMD64 calling convention.
As per the System V AMD64 ABI section 3.2.3, every 8 bytes receives its own argument class (arguments smaller than 8 bytes are grouped together or padded).
For an argument to be passed in a single vector register, it must consist of at least a single SSE class followed by any amount of SSEUP classes. SSE class denotes the lower-order 64 bits of a register, while SSEUP denotes the higher-order 64 bits.
__m128 and other vectors, for instance, are treated as multi-8-byte arguments consisting of SSE and SSEUP classes, so they are passed in a single register. In turn, every scalar float is assigned the SSE argument class and is passed in the lower portion of the registers.
Argument classes for aggregate types (arrays, structs and classes) and unions, however, are determined based on their composition.
As such, given a union:
union simd
{
__m128 vec;
float vals[4];
};
The __m128 vec vector falls under the special-case rule, and is classified as SSE+SSEUP, making it possible to pass by a single register. So far so good. However, since the float vals[4] array consists of 2 (independent!) 8-byte chunks, and each of the 8-byte chunks is assigned the SSE class, the array itself is in turn classified as SSE+SSE, which does not fit the SSE+SSEUP requirement, which in turn forces it to be passed using the lower portions of the 2 separate XMM registers, and as the lowest-common-denominator, causes the union itself to be treated as 2 arguments and passed in 2 registers.
To put it shortly, the calling convention treats the array as 2 separate 8-byte arguments, and as such has to pass it in 2 separate registers, while the standalone __m128 is treated as a single argument and is passed in a single register.
This, curiously, makes it so that the following union
union simd
{
__m128 vec;
float vals[2];
};
Is, in fact, treated as SSE+ SSEUP classes and is thus passed in a single register. The __m128 vec is treated as SSE & SSEUP, while float vec[2] is treated as a single SSE class.
Unfortunately, it seems there is no way to explicitly specify (or hint) argument classes to the compiler.

Related

How does GCC compile the 80 bit wide 10 byte float __float80 on x86_64?

According to one of the slides in the video by What's A Creel video, "Modern x64 Assembly 4: Data Types" (link to the slide),
Note: real10 is only used with the x87 FPU, it is largely ignored nowadays but offers amazing precision!
He says,
"Real10 is only used with the x87 Floating Point Unit. [...] It's interesting the massive gain in precision that it offers you. You kind of take a performance hit with that gain because you can't use real10 with SSE, packed, SIMD style instructions. But, it's kind of interesting because if you want extra precision you can go to the x87 style FPU. Now a days it's almost never used at all."
However, I was googling and saw that GCC supports __float80 and __float128.
Is the __float80 in GCC calculated on the x87? Or it is using SIMD like the other float operations? What about __float128?

GCC docs for Additional Floating Types:
ISO/IEC TS 18661-3:2015 defines C support for additional floating types _Floatn and _Floatnx
... GCC does not currently support _Float128x on any systems.
I think _Float128x is IEEE binary128, i.e. a true 128-bit float with a huge exponent range. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1691.pdf.
__float80 is obviously the x87 10-byte type. In the x86-64 SysV ABI, it's the same as long double; both have 16-byte alignment in that ABI.
__float80 is available on the i386, x86_64, and IA-64 targets, and supports the 80-bit (XFmode) floating type. It is an alias for the type name _Float64x on these targets.
I think __float128 is an extended-precision type using SSE2, presumably a "double double" format with twice the mantissa width but the same exponent limits as 64-bit double. (i.e. less exponent range than __float80)
On i386, x86_64, and ..., __float128 is an alias for _Float128
float128 and double-double arithmetic
Optimize for fast multiplication but slow addition: FMA and doubledouble
double-double implementation resilient to FPU rounding mode
Those are probably the same doubledouble that gcc gives you with __float128. Or maybe it's a pure software floating point 128-bit
Godbolt compiler explorer for gcc7.3 -O3 (same as gcc4.6, apparently these types aren't new)
//long double add_ld(long double x) { return x+x; } // same as __float80
__float80 add80(__float80 x) { return x+x; }
fld TBYTE PTR [rsp+8] # arg on the stack
fadd st, st(0)
ret # and returned in st(0)
__float128 add128(__float128 x) { return x+x; }
# IDK why not movapd or better movaps, silly compiler
movdqa xmm1, xmm0 # x arg in xmm0
sub rsp, 8 # align the stack
call __addtf3 # args in xmm0, xmm1
add rsp, 8
ret # return value in xmm0, I assume
int size80 = sizeof(__float80); // 16
int sizeld = sizeof(long double); // 16
int size128 = sizeof(__float128); // 16
So gcc calls a libgcc function for __float128 addition, not inlining an increment to the exponent or anything clever like that.

I found the answer here
__float80 is available on the i386, x86_64, and IA-64 targets, and supports the 80-bit (XFmode) floating type. It is an alias for the type name _Float64x on these targets.
Having looked up the XFmode,
“Extended Floating” mode represents an IEEE extended floating point number. This mode only has 80 meaningful bits (ten bytes). Some processors require such numbers to be padded to twelve bytes, others to sixteen; this mode is used for either.
Still not totally convinced, I compiled something simple
int main () {
__float80 a = 1.445839898;
return 1;
}
Using Radare I dumped it,
0x00000652 db2dc8000000 fld xword [0x00000720]
0x00000658 db7df0 fstp xword [local_10h]
I believe fld, and fstp are part of the x87 instruction set. So it's true it's being used for the __float80 10 byte float, however on the __float128, I'm getting
0x000005fe 660f6f05aa00. movdqa xmm0, xmmword [0x000006b0]
0x00000606 0f2945f0 movaps xmmword [local_10h], xmm0
So we can see here that we're using SIMD xmmword

Aligned and unaligned memory access with AVX/AVX2 intrinsics

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g.
vaddps ymm0,ymm0,YMMWORD PTR [rax]
the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as
vmovaps ymm0,YMMWORD PTR [rax]
the load address has to be aligned (to multiples of 32), otherwise an exception is raised.
What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:
#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#define SIZE (1L << 26)
#define OFFSET 1
int main() {
float *data;
assert(!posix_memalign((void**)&data, 32, SIZE*sizeof(float)));
for (unsigned i = 0; i < SIZE; i++) data[i] = drand48();
float res[8] __attribute__ ((aligned(32)));
__m256 sum = _mm256_setzero_ps(), elem;
for (float *d = data + OFFSET; d < data + SIZE - 8; d += 8) {
elem = _mm256_load_ps(d);
// sum = _mm256_add_ps(elem, elem);
sum = _mm256_add_ps(sum, elem);
}
_mm256_store_ps(res, sum);
for (int i = 0; i < 8; i++) printf("%g ", res[i]); printf("\n");
return 0;
}
(Yes, I know the code is faulty, since I use an aligned load on unaligned addresses, but bear with me...)
I compile the code with
g++ -Wall -O3 -march=native -o memtest memtest.C
on a CPU with AVX. If I check the code generated by g++ by using
objdump -S -M intel-mnemonic memtest | more
I see that the compiler does not generate an aligned load instruction, but loads the data directly in the vector addition instruction:
vaddps ymm0,ymm0,YMMWORD PTR [rax]
The code executes without any problem, even though the memory addresses are not aligned (OFFSET is 1). This is clear since vaddps tolerates unaligned addresses.
If I uncomment the line with the second addition intrinsic, the compiler cannot fuse the load and the addition since vaddps can only have a single memory source operand, and generates:
vmovaps ymm0,YMMWORD PTR [rax]
vaddps ymm1,ymm0,ymm0
vaddps ymm0,ymm1,ymm0
And now the program seg-faults, since a dedicated aligned load instruction is used, but the memory address is not aligned. (The program doesn't seg-fault if I use _mm256_loadu_ps, or if I set OFFSET to 0, by the way.)
This leaves the programmer at the mercy of the compiler and makes the behavior partly unpredictable, in my humble opinion.
My question is: Is there a way to force the C compiler to either generate a direct load in a processing instruction (such as vaddps) or to generate a dedicated load instruction (such as vmovaps)?

There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.
In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded
vmulps ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps ymm8, ymm9, ymm8
However in a previous verison of GCC the loads were not folded:
vmovups ymm9, YMMWORD PTR [rax-256]
vmulps ymm9, ymm0, ymm9
vaddps ymm8, ymm8, ymm9
The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.

In addition to Z boson's answer I can tell that the problem can be caused by that the compiler assumes the memory region is aligned (because of __attribute__ ((aligned(32))) marking the array). In runtime that attribute may not work for values on the stack because the stack is only 16-byte aligned (see this bug, which is still open at the time of this writing, though some fix have made it into gcc 4.6). The compiler is in its rights to choose the instructions to implement intrinsics, so it may or may not fold the memory load into the computational instruction, and it is also in its rights to use vmovaps when the folding does not occur (because, as noted before, the memory region is supposed to be aligned).
You can try forcing the compiler to realign the stack to 32 bytes upon entry in main by specifying -mstackrealign and -mpreferred-stack-boundary=5 (see here) but it will incur a performance overhead.

Checking if TWO SSE registers are not both zero without destroying them

I want to test if two SSE registers are not both zero without destroying them.
This is the code I currently have:
uint8_t *src; // Assume it is initialized and 16-byte aligned
__m128i xmm0, xmm1, xmm2;
xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1
xmm1 = _mm_load_si128((__m128i const*)&src[i+16]);
xmm2 = _mm_or_si128(xmm0, xmm1);
if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero
}
Is this the best way (using up to SSE 4.2)?

I learned something useful from this question. Let's first look at some scalar code
extern foo2(int x, int y);
void foo(int x, int y) {
if((x || y)!=0) foo2(x,y);
}
Compile this like this gcc -O3 -S -masm=intel test.c and the important assembly is
mov eax, edi ; edi = x, esi = y -> copy x into eax
or eax, esi ; eax = x | y and set zero flag in FLAGS if zero
jne .L4 ; jump not zero
Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
__m128i z = _mm_or_si128(x,y);
if (!_mm_testz_si128(z,z)) foo2(x,y);
}
Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c and the the important assembly is
movdqa xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por xmm2, xmm1 ; xmm2 = x | y
ptest xmm2, xmm2 ; set zero flag if zero
jne .L4 ; jump not zero
Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax in the scalar case and xmm2 in the SIMD case). So to answer your question your current solution is the best you can do.
However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8. Another alternative which only needs SSE2 is to use _mm_movemask_epi8
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);
}
The important assembly is
movdqa xmm2, xmm0
por xmm2, xmm1
pmovmskb eax, xmm2
test eax, eax
jne .L4
Notice that this needs one more instruction then with the SSE4.1 ptest instruction.
Until now I have been using the pmovmaskb instruction because the latency is better on pre Sandy Bridge processors than with ptest. However, I realized this before Haswell. On Haswell the latency of pmovmaskb is worse than the latency of ptest. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest in my critical loop. Thank you for your question.
Edit: as suggested by the OP there is a way this can be done without using another SSE register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);
}
The relevant assembly from GCC is:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
or edx, eax
jne .L4
Instead of using another xmm register this uses two scalar registers.
Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.

If you use C / C ++, you can not control the individual CPU registers. If you want full control, you must use assembler.

What's the difference between GCC builtin vectorization types and C arrays?

I have three functions a(), b() and c() that are supposed to do the same thing:
typedef float Builtin __attribute__ ((vector_size (16)));
typedef struct {
float values[4];
} Struct;
typedef union {
Builtin b;
Struct s;
} Union;
extern void printv(Builtin);
extern void printv(Union);
extern void printv(Struct);
int a() {
Builtin m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
int b() {
Union m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
int c() {
Struct m = { 1.0, 2.0, 3.0, 4.0 };
printv(m);
}
When I compile this code I observe the following behaviour:
When calling printv() in a() all 4 floats are being passed by %xmm0. No writes to memory occur.
When calling printv() in b() 2 floats are being passed by %xmm0 and the two other floats by %xmm1. To accomplish this 4 floats are loaded (.LC0) to %xmm2 and from there to memory. After that, 2 floats are read from the same place in memory to %xmm0 and the 2 other floats are loaded (.LC1) to %xmm1.
I'm a bit lost on what c() actually does.
Why are a(), b() and c() different?
Here is the assembly output for a():
vmovaps .LC0(%rip), %xmm0
call _Z6printvU8__vectorf
The assembly output for b():
vmovaps .LC0(%rip), %xmm2
vmovaps %xmm2, (%rsp)
vmovq .LC1(%rip), %xmm1
vmovq (%rsp), %xmm0
call _Z6printv5Union
And the assembly output for c():
andq $-32, %rsp
subq $32, %rsp
vmovaps .LC0(%rip), %xmm0
vmovaps %xmm0, (%rsp)
vmovq .LC2(%rip), %xmm0
vmovq 8(%rsp), %xmm1
call _Z6printv6Struct
The data:
.section .rodata.cst16,"aM",#progbits,16
.align 16
.LC0:
.long 1065353216
.long 1073741824
.long 1077936128
.long 1082130432
.section .rodata.cst8,"aM",#progbits,8
.align 8
.LC1:
.quad 4647714816524288000
.align 8
.LC2:
.quad 4611686019492741120
The quad 4647714816524288000 seems to be nothing more than the floats 3.0 and 4.0 in adjacent long words.

Nice question, I had to dig a little because I never used SSE (in this case SSE2) myself. Essentially vector instructions are used to operate on multiple values stored in one register i.e. the XMM(0-7) registers. In C the data type float uses IEEE 754 and its length is thus 32bits. Using four floats will yield a vector of length 128bits which is exactly the length of the XMM(0-7) registers. Now the registers provided by SSE look like this:
SSE (avx-128): |----------------|name: XMM0; size: 128bit
SSE (avx-256): |----------------|----------------|name: YMM0; size: 256bit
In your first case a() you use the SIMD vectorization with
typedef float Builtin __attribute__ ((vector_size (16)));
which allows you to shift the entire vector in one go into the XMM0 register. Now in your second case b() you use a union. But because you do not load .LC0 into the union with Union m.b = { 1.0, 2.0, 3.0, 4.0 }; the data is not recognized as a vectorization. This leads to the following behavior:
The data from .LC0 is loaded into XMM2 with:
vmovaps .LC0(%rip), %xmm2
but because your data can be interpreted as a structure or as a vectorization the data has to be split up into two 64bit chunks which will still have to be in the XMM(0-7) registers because it can be treated as a vectorization but it has to be maximally 64bit long that it can be transferred to a register (which is only 64bit wide and would overflow if 128bit were transferred to it; data is lost) because the data can also be treated as a structure. This is done in the following.
The vectorization in XMM2 is loaded to memory with
vmovaps %xmm2, (%rsp)
now the upper 64bits of the vectorization (bits 64-127), i.e. the floats 3.0 and 4.0 are moved (vmovq moves quadword i.e. 64 bits ) to XMM1 with
vmovq .LC1(%rip), %xmm1
and finally the lower 64bits of the vectorization (bits 0-63) i.e. the floats 1.0 and 2.0 are moved from memory to XMM0 with
vmovq (%rsp), %xmm0
Now you have the upper and the lower part of the 128bit vector in separate XMM(0-7) registers.
Now in case c() I'm not quite sure as well but here it goes. First %rsp is aligned to a 32bit address and then 32 byte are subtracted to store the data on the stack (this will align to a 32bit address again) this is done with
andq $-32, %rsp
subq $32, %rsp
now this time the vectorization is loaded into XMM0 and then placed on the stack with
vmovaps .LC0(%rip), %xmm0
vmovaps %xmm0, (%rsp)
and finally the upper 64bits of the vectorization are stored in XMM0 and the lower 64bits are stored in the XMM1 register with
vmovq .LC2(%rip), %xmm0
vmovq 8(%rsp), %xmm1
In all three cases the vectorization is treated differently. Hope this helps.

cmpxchg example for 64 bit integer

I am using cmpxchg (compare-and-exchange) in i686 architecture for 32 bit compare and swap as follows.
(Editor's note: the original 32-bit example was buggy, but the question isn't about it. I believe this version is safe, and as a bonus compiles correctly for x86-64 as well. Also note that inline asm isn't needed or recommended for this; __atomic_compare_exchange_n or the older __sync_bool_compare_and_swap work for int32_t or int64_t on i486 and x86-64. But this question is about doing it with inline asm, in case you still want to.)
// note that this function doesn't return the updated oldVal
static int CAS(int *ptr, int oldVal, int newVal)
{
unsigned char ret;
__asm__ __volatile__ (
" lock\n"
" cmpxchgl %[newval], %[mem]\n"
" sete %0\n"
: "=q" (ret), [mem] "+m" (*ptr), "+a" (oldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
What is the equivalent for x86_64 architecture for 64 bit compare and swap
static int CAS(long *ptr, long oldVal, long newVal)
{
unsigned char ret;
// ?
return ret;
}

The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.
There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.
But, as stated, cmpxchgq is probably the better option for 64-bit code.
If you need to cmpxchg a 16 byte object, the 64-bit version of cmpxchg8b is cmpxchg16b. It was missing from the very earliest AMD64 CPUs, so compilers won't generate it for std::atomic::compare_exchange on 16B objects unless you enable -mcx16 (for gcc). Assemblers will assemble it, though, but beware that your binary won't run on the earliest K8 CPUs. (This only applies to cmpxchg16b, not to cmpxchg8b in 64-bit mode, or to cmpxchgq).

cmpxchg8b
__forceinline int64_t interlockedCompareExchange(volatile int64_t & v,int64_t exValue,int64_t cmpValue)
{
__asm {
mov esi,v
mov ebx,dword ptr exValue
mov ecx,dword ptr exValue + 4
mov eax,dword ptr cmpValue
mov edx,dword ptr cmpValue + 4
lock cmpxchg8b qword ptr [esi]
}
}

The x64 architecture supports a 64-bit compare-exchange using the good, old cmpexch instruction. Or you could also use the somewhat more complicated cmpexch8b instruction (from the "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"):
The CMPXCHG instruction compares a
value in the AL or rAX register with
the first (destination) operand, and
sets the arithmetic flags (ZF, OF, SF,
AF, CF, PF) according to the result.
If the compared values are equal, the
source operand is loaded into the
destination operand. If they are not
equal, the first operand is loaded
into the accumulator. CMPXCHG can be
used to try to intercept a semaphore,
i.e. test if its state is free, and if
so, load a new value into the
semaphore, making its state busy. The
test and load are performed
atomically, so that concurrent
processes or threads which use the
semaphore to access a shared object
will not conflict.
The CMPXCHG8B
instruction compares the 64-bit values
in the EDX:EAX registers with a 64-bit
memory location. If the values are
equal, the zero flag (ZF) is set, and
the ECX:EBX value is copied to the
memory location. Otherwise, the ZF
flag is cleared, and the memory value
is copied to EDX:EAX.
The CMPXCHG16B
instruction compares the 128-bit value
in the RDX:RAX and RCX:RBX registers
with a 128-bit memory location. If the
values are equal, the zero flag (ZF)
is set, and the RCX:RBX value is
copied to the memory location.
Otherwise, the ZF flag is cleared, and
the memory value is copied to rDX:rAX.
Different assembler syntaxes may need to have the length of the operations specified in the instruction mnemonic if the size of the operands can't be inferred. This may be the case for GCC's inline assembler - I don't know.

usage of cmpxchg8B from AMD64 Architecture Programmer's Manual V3:
Compare EDX:EAX register to 64-bit memory location. If equal, set the zero flag (ZF) to 1 and copy the ECX:EBX register to the memory location. Otherwise,
copy the memory location to EDX:EAX and clear the zero flag.
I use cmpxchg8B to implement a simple mutex lock function in x86-64 machine. here is the code
.text
.align 8
.global mutex_lock
mutex_lock:
pushq %rbp
movq %rsp, %rbp
jmp .L1
.L1:
movl $0, %edx
movl $0, %eax
movl $0, %ecx
movl $1, %ebx
lock cmpxchg8B (%rdi)
jne .L1
popq %rbp
ret

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio