Aligned and unaligned memory access with AVX/AVX2 intrinsics

Aligned and unaligned memory access with AVX/AVX2 intrinsics - gcc

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g.
vaddps ymm0,ymm0,YMMWORD PTR [rax]
the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as
vmovaps ymm0,YMMWORD PTR [rax]
the load address has to be aligned (to multiples of 32), otherwise an exception is raised.
What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:
#include <x86intrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#define SIZE (1L << 26)
#define OFFSET 1
int main() {
float *data;
assert(!posix_memalign((void**)&data, 32, SIZE*sizeof(float)));
for (unsigned i = 0; i < SIZE; i++) data[i] = drand48();
float res[8] __attribute__ ((aligned(32)));
__m256 sum = _mm256_setzero_ps(), elem;
for (float *d = data + OFFSET; d < data + SIZE - 8; d += 8) {
elem = _mm256_load_ps(d);
// sum = _mm256_add_ps(elem, elem);
sum = _mm256_add_ps(sum, elem);
}
_mm256_store_ps(res, sum);
for (int i = 0; i < 8; i++) printf("%g ", res[i]); printf("\n");
return 0;
}
(Yes, I know the code is faulty, since I use an aligned load on unaligned addresses, but bear with me...)
I compile the code with
g++ -Wall -O3 -march=native -o memtest memtest.C
on a CPU with AVX. If I check the code generated by g++ by using
objdump -S -M intel-mnemonic memtest | more
I see that the compiler does not generate an aligned load instruction, but loads the data directly in the vector addition instruction:
vaddps ymm0,ymm0,YMMWORD PTR [rax]
The code executes without any problem, even though the memory addresses are not aligned (OFFSET is 1). This is clear since vaddps tolerates unaligned addresses.
If I uncomment the line with the second addition intrinsic, the compiler cannot fuse the load and the addition since vaddps can only have a single memory source operand, and generates:
vmovaps ymm0,YMMWORD PTR [rax]
vaddps ymm1,ymm0,ymm0
vaddps ymm0,ymm1,ymm0
And now the program seg-faults, since a dedicated aligned load instruction is used, but the memory address is not aligned. (The program doesn't seg-fault if I use _mm256_loadu_ps, or if I set OFFSET to 0, by the way.)
This leaves the programmer at the mercy of the compiler and makes the behavior partly unpredictable, in my humble opinion.
My question is: Is there a way to force the C compiler to either generate a direct load in a processing instruction (such as vaddps) or to generate a dedicated load instruction (such as vmovaps)?

There is no way to explicitly control folding of loads with intrinsics. I consider this a weakness of intrinsics. If you want to explicitly control the folding then you have to use assembly.
In previous version of GCC I was able to control the folding to some degree using an aligned or unaligned load. However, that no longer appears to be the case (GCC 4.9.2). I mean for example in the function AddDot4x4_vec_block_8wide here the loads are folded
vmulps ymm9, ymm0, YMMWORD PTR [rax-256]
vaddps ymm8, ymm9, ymm8
However in a previous verison of GCC the loads were not folded:
vmovups ymm9, YMMWORD PTR [rax-256]
vmulps ymm9, ymm0, ymm9
vaddps ymm8, ymm8, ymm9
The correct solution is, obviously, to only used aligned loads when you know the data is aligned and if you really want to explicitly control the folding use assembly.

In addition to Z boson's answer I can tell that the problem can be caused by that the compiler assumes the memory region is aligned (because of __attribute__ ((aligned(32))) marking the array). In runtime that attribute may not work for values on the stack because the stack is only 16-byte aligned (see this bug, which is still open at the time of this writing, though some fix have made it into gcc 4.6). The compiler is in its rights to choose the instructions to implement intrinsics, so it may or may not fold the memory load into the computational instruction, and it is also in its rights to use vmovaps when the folding does not occur (because, as noted before, the memory region is supposed to be aligned).
You can try forcing the compiler to realign the stack to 32 bytes upon entry in main by specifying -mstackrealign and -mpreferred-stack-boundary=5 (see here) but it will incur a performance overhead.

Related

Why is my SSE assembly slower in release builds?

I've been playing around with some x64 assembly and the XMM registers to do some float math, and I'm seeing some performance that is puzzling me.
As a self-learning exercise, I wrote some SSE assembly to approximate the 'sin' function (using the Taylor series), and called this from some basic C++ in a loop to compare to the standard library version. Code is below, and I've pasted the output for some typical runs after that. (I'm not looking for a critique of the code or approach here, just trying to understand the perf numbers).
What I don't get is why with a "Release" build, where the actual running assembly is identical (I've stepped though the debugger to double check), is consistently about 40 - 50 cycles slower. (Uncommenting the LFENCE instructions adds about 100 cycles to both Debug and Release, so the delta remains the same). As a bonus question, why is the very first iteration typically in the thousands!!
I get this stuff is very complex and subtly impacted by numerous factors, but everything that pops in my head as a potential cause here just doesn't make sense.
I've checked the MSCSR flags in both runs, and this is identical across builds also (with the default value of 1f80h which has all exceptions masked).
Any idea what would cause this? What further analysis could I do to figure this out an an even deeper level?
Assembly
_RDATA segment
pi real4 3.141592654
rf3 real4 0.1666666667
rf5 real4 0.008333333333
rf7 real4 0.0001984126984
_RDATA ends
_TEXT segment
; float CalcSin(float rads, int* cycles)
CalcSin PROC
; "leaf" function - doesn't use the stack or any non-volatile registers
mov r8, rdx ; Save the 'cycles' pointer into R8
rdtsc ; Get current CPU cyles in EDX:EAX
; lfence ; Ensure timer is taken before executing the below
mov ecx, eax ; Save the low 32 bits of the timer into ECX
movss xmm2, xmm0
mulss xmm2, xmm2 ; X^2
movss xmm3, xmm0
mulss xmm3, xmm2 ; x^3
movss xmm4, rf3 ; 1/3!
mulss xmm4, xmm3 ; x^3 / 3!
subss xmm0, xmm4 ; x - x^3 / 3!
mulss xmm3, xmm2 ; x^5
movss xmm4, rf5 ; 1/5!
mulss xmm4, xmm3 ; x^5 / 5!
addss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5!
mulss xmm3, xmm2 ; x^7
movss xmm4, rf7 ; 1/7!
mulss xmm4, xmm3 ; x^7 / 7!
subss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5! - x^7 / 7!
; lfence ; Ensure above completes before taking the timer again
rdtsc ; Get the timer now
sub eax, ecx ; Get the difference in cycles
mov dword ptr [r8], eax
ret
CalcSin ENDP
_TEXT ends
END
C++
#include <stdio.h>
#include <math.h>
#include <vector>
const float PI = 3.141592654f;
extern "C" float CalcSin(float rads, int* cycles);
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
return 0;
}
With a "Debug" build (I'm using Visual Studio 2019), I typically see the below timing reported:
Sin(0.00314159) = 0.00314159. Took 3816 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 18 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 18 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 18 cycles
C library = 1.00000000
The exact same code with a "Release" build, I typically see the below:
Sin(0.00314159) = 0.00314159. Took 4426 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 70 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 62 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 64 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 62 cycles
C library = 1.00000000
====UPDATE 1====
I changed the code to load the constants as immediates, instead of referencing the .rdata segment as Peter mentioned, and this got rid of the slow first iteration, i.e. replaced the commented out line with the 2 lines following:
; movss xmm4, rf5 ; 1/5!
mov eax, 3C088889h ; 1/5! float representation
movd xmm4, eax
Warming up the CPU didn't help, but I did notice the first iteration in Release was now just as fast as debug, and the rest were still slow. As the printf isn't called until after the first calculation, I wondered if this had an impact. I change the code to just store the results as it ran, and print them once complete, and now Release is just as fast. i.e.
Updated C++ code
extern "C" float CalcSin(float rads, int* cycles);
std::vector<float> values;
std::vector<int> rdtsc;
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
values.push_back(result);
rdtsc.push_back(cycles);
// printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
// printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
auto cycle_iter = rdtsc.begin();
auto value_iter = values.begin();
for (auto& input : inputs) {
printf("Sin(%.8f) = %.8f. Took %d cycles\n", input, *value_iter++, *cycle_iter++);
printf("C library = %.8f\n", sin(input));
}
return 0;
}
And now Release is pretty much identical to debug, i.e. around 18 - 24 cycles consistently on each call.
I'm not sure what the printf call is doing in Release builds, or maybe the way it was linked/optimized with Release settings, but strange it negatively impacted the identical and distinct assembly calls as it did.
Sin(0.00314159) = 0.00314159. Took 18 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 24 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 20 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 24 cycles
C library = 1.00000000
====UPDATE 2====
To rule out the CPU ramp-up down, I went in and tweaked a few bios settings (disabled Turbo, set a consistent core voltage, etc.), and can now see via the "AI Suite" ASUS app for the motherboard the CPU is a consistent 3600MHz. (I'm running an Intel Core i9-9900k # 3.6GHz on Windows 10 x64).
After setting that... still no change.
Next thing that occurred to me is that with the 'printf' I have a call out to the C-runtime library between each loop, which is a different DLL between Debug and Release builds. To remove any other variation I starting building from the command-line instead of VS. Compiling with maximum speed optimizations and the release CRT DLLs (/O2 and /MD respectively), I still see the same slow-down. Switching to the debug CRT DLLs, I see some improvement. If I switch static linking in the CRT, then it doesn't matter if I use the debug or release versions, or if I compile with optimizations or not, I regularly see the 24 cycles per call, i.e.
ml64 /c ..\x64simd.asm
cl.exe /Od /MT /Feapp.exe ..\main.cpp x64simd.obj
>app.exe
Sin(0.00314159) = 0.00314159. Took 24 cycles
Sin(1.56765473) = 0.99984086. Took 24 cycles
Sin(0.78539819) = 0.70710647. Took 24 cycles
Sin(0.00010000) = 0.00010000. Took 24 cycles
Sin(1.57079637) = 0.99984306. Took 24 cycles
So it's definitely something in calling out to the CRT Release DLLs causing the slow-down. I'm still puzzled as to why, especially as the Debug build in VS is also using CRT via DLLs.

You're timing in reference cycles with rdtsc, not core clock cycles. It's probably the same speed both times, in core clock cycles, but with the CPU running at different frequencies.
Probably a debug build gives the CPU time to ramp up to max turbo (more core cycles per reference cycle) before your function gets called. Because the calling code compiles to slower asm. And especially with MSVC, a debug build adds extra stuff like poisoning the stack frame to catch use of uninitialized vars. And also overhead for incremental linking.
None of this slows down your hand-written function itself, it's just "warm up" that you neglected to do manually in your microbenchmark.
See How to get the CPU cycle count in x86_64 from C++? for lots more details about RDTSC.
A factor of ~3 between idle CPU clock and max-turbo (or some higher clock) is very plausible for modern x86 CPUs. My i7-6700k idles at 0.8GHz with rated frequency of 4.0GHz, max single-core turbo of 4.2. But many laptop CPUs much lower non-turbo max (and might only ramp to non-turbo initially, not max turbo right away, depending on energy_performance_preference HW governor, or especially software governor on older CPUs.)
As a bonus question, why is the very first iteration typically in the thousands!!
Probably dTLB miss and cache miss for loading rf3 from data memory. You could try loading those from C (by declaring extern volatile float rf3) to prime the TLB + cache for that block of constants, assuming they're all in the same cache line.
Possibly also an I-cache miss after the rdtsc, but the first load is probably before the end of an I-cache line so those could happen in parallel. (Putting the rdtsc inside your asm function means we probably aren't waiting for an iTLB miss or i-cache miss inside the timed region to even fetch the first byte of the function).
Code review:
Don't use movss between XMM registers unless you want to blend the low 4 bytes into the old value of the destination. Use movaps xmm2, xmm0 to copy the whole register; it's much more efficient.
movaps can be handled by register renaming without needing any back-end execution unit, vs. movss only running on one execution unit in Intel CPUs, port 5. https://agner.org/optimize/. Also, movaps avoids a false dependency on the old value of the register because it overwrites the full reg, allowing out-of-order exec to work properly.
movss xmm, [mem] is fine, though: as a load it zero-extends into the full register.

How does GCC compile the 80 bit wide 10 byte float __float80 on x86_64?

According to one of the slides in the video by What's A Creel video, "Modern x64 Assembly 4: Data Types" (link to the slide),
Note: real10 is only used with the x87 FPU, it is largely ignored nowadays but offers amazing precision!
He says,
"Real10 is only used with the x87 Floating Point Unit. [...] It's interesting the massive gain in precision that it offers you. You kind of take a performance hit with that gain because you can't use real10 with SSE, packed, SIMD style instructions. But, it's kind of interesting because if you want extra precision you can go to the x87 style FPU. Now a days it's almost never used at all."
However, I was googling and saw that GCC supports __float80 and __float128.
Is the __float80 in GCC calculated on the x87? Or it is using SIMD like the other float operations? What about __float128?

GCC docs for Additional Floating Types:
ISO/IEC TS 18661-3:2015 defines C support for additional floating types _Floatn and _Floatnx
... GCC does not currently support _Float128x on any systems.
I think _Float128x is IEEE binary128, i.e. a true 128-bit float with a huge exponent range. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1691.pdf.
__float80 is obviously the x87 10-byte type. In the x86-64 SysV ABI, it's the same as long double; both have 16-byte alignment in that ABI.
__float80 is available on the i386, x86_64, and IA-64 targets, and supports the 80-bit (XFmode) floating type. It is an alias for the type name _Float64x on these targets.
I think __float128 is an extended-precision type using SSE2, presumably a "double double" format with twice the mantissa width but the same exponent limits as 64-bit double. (i.e. less exponent range than __float80)
On i386, x86_64, and ..., __float128 is an alias for _Float128
float128 and double-double arithmetic
Optimize for fast multiplication but slow addition: FMA and doubledouble
double-double implementation resilient to FPU rounding mode
Those are probably the same doubledouble that gcc gives you with __float128. Or maybe it's a pure software floating point 128-bit
Godbolt compiler explorer for gcc7.3 -O3 (same as gcc4.6, apparently these types aren't new)
//long double add_ld(long double x) { return x+x; } // same as __float80
__float80 add80(__float80 x) { return x+x; }
fld TBYTE PTR [rsp+8] # arg on the stack
fadd st, st(0)
ret # and returned in st(0)
__float128 add128(__float128 x) { return x+x; }
# IDK why not movapd or better movaps, silly compiler
movdqa xmm1, xmm0 # x arg in xmm0
sub rsp, 8 # align the stack
call __addtf3 # args in xmm0, xmm1
add rsp, 8
ret # return value in xmm0, I assume
int size80 = sizeof(__float80); // 16
int sizeld = sizeof(long double); // 16
int size128 = sizeof(__float128); // 16
So gcc calls a libgcc function for __float128 addition, not inlining an increment to the exponent or anything clever like that.

I found the answer here
__float80 is available on the i386, x86_64, and IA-64 targets, and supports the 80-bit (XFmode) floating type. It is an alias for the type name _Float64x on these targets.
Having looked up the XFmode,
“Extended Floating” mode represents an IEEE extended floating point number. This mode only has 80 meaningful bits (ten bytes). Some processors require such numbers to be padded to twelve bytes, others to sixteen; this mode is used for either.
Still not totally convinced, I compiled something simple
int main () {
__float80 a = 1.445839898;
return 1;
}
Using Radare I dumped it,
0x00000652 db2dc8000000 fld xword [0x00000720]
0x00000658 db7df0 fstp xword [local_10h]
I believe fld, and fstp are part of the x87 instruction set. So it's true it's being used for the __float80 10 byte float, however on the __float128, I'm getting
0x000005fe 660f6f05aa00. movdqa xmm0, xmmword [0x000006b0]
0x00000606 0f2945f0 movaps xmmword [local_10h], xmm0
So we can see here that we're using SIMD xmmword

What is the benefits of using vaddss instead of addss in scalar matrix addition?

I have implemented scalar matrix addition kernel.
#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>
//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000
float __attribute__(( aligned(32))) A[N][M],
__attribute__(( aligned(32))) B[N][M],
__attribute__(( aligned(32))) C[N][M];
int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
clock_gettime(CLOCK_MONOTONIC,&tStart);
for( i=0;i<N;i++){
for(j=0;j<M;j++){
C[i][j]= A[i][j] + B[i][j];
}
}
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
} while(w++ < NUM_LOOP);
printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}
In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:
gcc -O2 msse4.2: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix
movss xmm1, DWORD PTR A[rcx+rax]
addss xmm1, DWORD PTR B[rcx+rax]
movss DWORD PTR C[rcx+rax], xmm1
gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix
vmovss xmm1, DWORD PTR A[rcx+rax]
vaddss xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss DWORD PTR C[rcx+rax], xmm1
AVX version gcc -O2 -mavx:
__m256 vec256;
for(i=0;i<N;i++){
for(j=0;j<M;j+=8){
vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) , _mm256_load_ps(&B[i+1][j]));
_mm256_store_ps(&C[i+1][j], vec256);
}
}
SSE version gcc -O2 -sse4.2::
__m128 vec128;
for(i=0;i<N;i++){
for(j=0;j<M;j+=4){
vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) , _mm_load_ps(&B[i][j]));
_mm_store_ps(&C[i][j], vec128);
}
}
In scalar program the speedup of -mavx over msse4.2 is 2.7x. I know the avx improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX and SSE the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat reported.
using gcc -O2 , linux mint, skylake
UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256 is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason.

The problem you are observing is explained here. On Skylake systems if the upper half of an AVX register is dirty then there is false dependency for non-vex encoded SSE operations on the upper half of the AVX register. In your case it seems there is a bug in your version of glibc 2.23. On my Skylake system with Ubuntu 16.10 and glibc 2.24 I don't have the problem. You can use
__asm__ __volatile__ ( "vzeroupper" : : : );
to clean the upper half of the AVX register. I don't think you can use an intrinsic such as _mm256_zeroupper to fix this because GCC will say it's SSE code and not recognize the intrinsic. The options -mvzeroupper won't work either because GCC one again thinks it's SSE code and will not emit the vzeroupper instruction.
BTW, it's Microsoft's fault that the hardware has this problem.
Update:
Other people are apparently encountering this problem on Skylake. It has been observed after printf, memset, and clock_gettime.
If your goal is to compare 128-bit operations with 256-bit operations could consider using -mprefer-avx128 -mavx (which is particularly useful on AMD). But then you would be comparing AVX256 vs AVX128 and not AVX256 vs SSE. AVX128 and SSE both use 128-bit operations but their implementations are different. If you benchmark you should mention which one you used.

Predict cache line length using time measure

Hi it's my first post here so Hi everyone.
I have problem with predict cache line size in GNU AS. I wrote program in C which calls a function written in assembly.
here is this function
.section .text
.section .data
.global time
time:
pushl %ebp
xor %edx, %edx
xor %eax, %eax
CPUID
RDTSC
popl %ebp
ret
It measure CPU cycles
C code is:
#include <stdio.h>
const int size = 256;
void main(){
unsigned long long cykl, cykl1, cykl2;
unsigned char matrix[size];
char bla;
int i,j,k;
for(i=0 ; i<size; i++)
{
cykl1 = time();
bla = matrix[i];
cykl2 = time();
cykl = cykl2 - cykl1;
printf("i=%d: %lld \n",i, cykl);
}
}
I ran this program, but I can't see any time difference. As I know my cache line lenght is 64 bytes.
Time should rise every time I load next 64bytes of array, am I right?
I will be gratefull for any advice why it can't work properly.

I think there are 3 problems.
First, what your program call might not be your assembly routine, but time(2) system call which returns the current time in seonds.
The assembly routine name should be prefixed with an underscore, i.e., _time, in your *.s file.
You can also declare the routine with __asm__ keyword. See:
http://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/Asm-Labels.html
Second, the memory access might be eliminated (or reordered) by GCC optimizer.
You should check the assembly code generated from your C code.
Third, RDTSC is not serialized.
In other words, the CPU might reorder an access to memory and an RDTSC insturction, or they are executed in parallel.
You should insert some instructions to prevent the reordering.
See the description of RDTSC in "Intel® 64 and IA-32 Architecture Software Developer's Manual Volume 2B: Instruction Set Reference, M-Z":
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

cmpxchg example for 64 bit integer

I am using cmpxchg (compare-and-exchange) in i686 architecture for 32 bit compare and swap as follows.
(Editor's note: the original 32-bit example was buggy, but the question isn't about it. I believe this version is safe, and as a bonus compiles correctly for x86-64 as well. Also note that inline asm isn't needed or recommended for this; __atomic_compare_exchange_n or the older __sync_bool_compare_and_swap work for int32_t or int64_t on i486 and x86-64. But this question is about doing it with inline asm, in case you still want to.)
// note that this function doesn't return the updated oldVal
static int CAS(int *ptr, int oldVal, int newVal)
{
unsigned char ret;
__asm__ __volatile__ (
" lock\n"
" cmpxchgl %[newval], %[mem]\n"
" sete %0\n"
: "=q" (ret), [mem] "+m" (*ptr), "+a" (oldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
What is the equivalent for x86_64 architecture for 64 bit compare and swap
static int CAS(long *ptr, long oldVal, long newVal)
{
unsigned char ret;
// ?
return ret;
}

The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.
There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.
But, as stated, cmpxchgq is probably the better option for 64-bit code.
If you need to cmpxchg a 16 byte object, the 64-bit version of cmpxchg8b is cmpxchg16b. It was missing from the very earliest AMD64 CPUs, so compilers won't generate it for std::atomic::compare_exchange on 16B objects unless you enable -mcx16 (for gcc). Assemblers will assemble it, though, but beware that your binary won't run on the earliest K8 CPUs. (This only applies to cmpxchg16b, not to cmpxchg8b in 64-bit mode, or to cmpxchgq).

cmpxchg8b
__forceinline int64_t interlockedCompareExchange(volatile int64_t & v,int64_t exValue,int64_t cmpValue)
{
__asm {
mov esi,v
mov ebx,dword ptr exValue
mov ecx,dword ptr exValue + 4
mov eax,dword ptr cmpValue
mov edx,dword ptr cmpValue + 4
lock cmpxchg8b qword ptr [esi]
}
}

The x64 architecture supports a 64-bit compare-exchange using the good, old cmpexch instruction. Or you could also use the somewhat more complicated cmpexch8b instruction (from the "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"):
The CMPXCHG instruction compares a
value in the AL or rAX register with
the first (destination) operand, and
sets the arithmetic flags (ZF, OF, SF,
AF, CF, PF) according to the result.
If the compared values are equal, the
source operand is loaded into the
destination operand. If they are not
equal, the first operand is loaded
into the accumulator. CMPXCHG can be
used to try to intercept a semaphore,
i.e. test if its state is free, and if
so, load a new value into the
semaphore, making its state busy. The
test and load are performed
atomically, so that concurrent
processes or threads which use the
semaphore to access a shared object
will not conflict.
The CMPXCHG8B
instruction compares the 64-bit values
in the EDX:EAX registers with a 64-bit
memory location. If the values are
equal, the zero flag (ZF) is set, and
the ECX:EBX value is copied to the
memory location. Otherwise, the ZF
flag is cleared, and the memory value
is copied to EDX:EAX.
The CMPXCHG16B
instruction compares the 128-bit value
in the RDX:RAX and RCX:RBX registers
with a 128-bit memory location. If the
values are equal, the zero flag (ZF)
is set, and the RCX:RBX value is
copied to the memory location.
Otherwise, the ZF flag is cleared, and
the memory value is copied to rDX:rAX.
Different assembler syntaxes may need to have the length of the operations specified in the instruction mnemonic if the size of the operands can't be inferred. This may be the case for GCC's inline assembler - I don't know.

usage of cmpxchg8B from AMD64 Architecture Programmer's Manual V3:
Compare EDX:EAX register to 64-bit memory location. If equal, set the zero flag (ZF) to 1 and copy the ECX:EBX register to the memory location. Otherwise,
copy the memory location to EDX:EAX and clear the zero flag.
I use cmpxchg8B to implement a simple mutex lock function in x86-64 machine. here is the code
.text
.align 8
.global mutex_lock
mutex_lock:
pushq %rbp
movq %rsp, %rbp
jmp .L1
.L1:
movl $0, %edx
movl $0, %eax
movl $0, %ecx
movl $1, %ebx
lock cmpxchg8B (%rdi)
jne .L1
popq %rbp
ret

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio