Predict cache line length using time measure

Predict cache line length using time measure - caching

Hi it's my first post here so Hi everyone.
I have problem with predict cache line size in GNU AS. I wrote program in C which calls a function written in assembly.
here is this function
.section .text
.section .data
.global time
time:
pushl %ebp
xor %edx, %edx
xor %eax, %eax
CPUID
RDTSC
popl %ebp
ret
It measure CPU cycles
C code is:
#include <stdio.h>
const int size = 256;
void main(){
unsigned long long cykl, cykl1, cykl2;
unsigned char matrix[size];
char bla;
int i,j,k;
for(i=0 ; i<size; i++)
{
cykl1 = time();
bla = matrix[i];
cykl2 = time();
cykl = cykl2 - cykl1;
printf("i=%d: %lld \n",i, cykl);
}
}
I ran this program, but I can't see any time difference. As I know my cache line lenght is 64 bytes.
Time should rise every time I load next 64bytes of array, am I right?
I will be gratefull for any advice why it can't work properly.

I think there are 3 problems.
First, what your program call might not be your assembly routine, but time(2) system call which returns the current time in seonds.
The assembly routine name should be prefixed with an underscore, i.e., _time, in your *.s file.
You can also declare the routine with __asm__ keyword. See:
http://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/Asm-Labels.html
Second, the memory access might be eliminated (or reordered) by GCC optimizer.
You should check the assembly code generated from your C code.
Third, RDTSC is not serialized.
In other words, the CPU might reorder an access to memory and an RDTSC insturction, or they are executed in parallel.
You should insert some instructions to prevent the reordering.
See the description of RDTSC in "Intel® 64 and IA-32 Architecture Software Developer's Manual Volume 2B: Instruction Set Reference, M-Z":
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Related

Why is my SSE assembly slower in release builds?

I've been playing around with some x64 assembly and the XMM registers to do some float math, and I'm seeing some performance that is puzzling me.
As a self-learning exercise, I wrote some SSE assembly to approximate the 'sin' function (using the Taylor series), and called this from some basic C++ in a loop to compare to the standard library version. Code is below, and I've pasted the output for some typical runs after that. (I'm not looking for a critique of the code or approach here, just trying to understand the perf numbers).
What I don't get is why with a "Release" build, where the actual running assembly is identical (I've stepped though the debugger to double check), is consistently about 40 - 50 cycles slower. (Uncommenting the LFENCE instructions adds about 100 cycles to both Debug and Release, so the delta remains the same). As a bonus question, why is the very first iteration typically in the thousands!!
I get this stuff is very complex and subtly impacted by numerous factors, but everything that pops in my head as a potential cause here just doesn't make sense.
I've checked the MSCSR flags in both runs, and this is identical across builds also (with the default value of 1f80h which has all exceptions masked).
Any idea what would cause this? What further analysis could I do to figure this out an an even deeper level?
Assembly
_RDATA segment
pi real4 3.141592654
rf3 real4 0.1666666667
rf5 real4 0.008333333333
rf7 real4 0.0001984126984
_RDATA ends
_TEXT segment
; float CalcSin(float rads, int* cycles)
CalcSin PROC
; "leaf" function - doesn't use the stack or any non-volatile registers
mov r8, rdx ; Save the 'cycles' pointer into R8
rdtsc ; Get current CPU cyles in EDX:EAX
; lfence ; Ensure timer is taken before executing the below
mov ecx, eax ; Save the low 32 bits of the timer into ECX
movss xmm2, xmm0
mulss xmm2, xmm2 ; X^2
movss xmm3, xmm0
mulss xmm3, xmm2 ; x^3
movss xmm4, rf3 ; 1/3!
mulss xmm4, xmm3 ; x^3 / 3!
subss xmm0, xmm4 ; x - x^3 / 3!
mulss xmm3, xmm2 ; x^5
movss xmm4, rf5 ; 1/5!
mulss xmm4, xmm3 ; x^5 / 5!
addss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5!
mulss xmm3, xmm2 ; x^7
movss xmm4, rf7 ; 1/7!
mulss xmm4, xmm3 ; x^7 / 7!
subss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5! - x^7 / 7!
; lfence ; Ensure above completes before taking the timer again
rdtsc ; Get the timer now
sub eax, ecx ; Get the difference in cycles
mov dword ptr [r8], eax
ret
CalcSin ENDP
_TEXT ends
END
C++
#include <stdio.h>
#include <math.h>
#include <vector>
const float PI = 3.141592654f;
extern "C" float CalcSin(float rads, int* cycles);
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
return 0;
}
With a "Debug" build (I'm using Visual Studio 2019), I typically see the below timing reported:
Sin(0.00314159) = 0.00314159. Took 3816 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 18 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 18 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 18 cycles
C library = 1.00000000
The exact same code with a "Release" build, I typically see the below:
Sin(0.00314159) = 0.00314159. Took 4426 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 70 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 62 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 64 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 62 cycles
C library = 1.00000000
====UPDATE 1====
I changed the code to load the constants as immediates, instead of referencing the .rdata segment as Peter mentioned, and this got rid of the slow first iteration, i.e. replaced the commented out line with the 2 lines following:
; movss xmm4, rf5 ; 1/5!
mov eax, 3C088889h ; 1/5! float representation
movd xmm4, eax
Warming up the CPU didn't help, but I did notice the first iteration in Release was now just as fast as debug, and the rest were still slow. As the printf isn't called until after the first calculation, I wondered if this had an impact. I change the code to just store the results as it ran, and print them once complete, and now Release is just as fast. i.e.
Updated C++ code
extern "C" float CalcSin(float rads, int* cycles);
std::vector<float> values;
std::vector<int> rdtsc;
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
values.push_back(result);
rdtsc.push_back(cycles);
// printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
// printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
auto cycle_iter = rdtsc.begin();
auto value_iter = values.begin();
for (auto& input : inputs) {
printf("Sin(%.8f) = %.8f. Took %d cycles\n", input, *value_iter++, *cycle_iter++);
printf("C library = %.8f\n", sin(input));
}
return 0;
}
And now Release is pretty much identical to debug, i.e. around 18 - 24 cycles consistently on each call.
I'm not sure what the printf call is doing in Release builds, or maybe the way it was linked/optimized with Release settings, but strange it negatively impacted the identical and distinct assembly calls as it did.
Sin(0.00314159) = 0.00314159. Took 18 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 24 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 20 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 24 cycles
C library = 1.00000000
====UPDATE 2====
To rule out the CPU ramp-up down, I went in and tweaked a few bios settings (disabled Turbo, set a consistent core voltage, etc.), and can now see via the "AI Suite" ASUS app for the motherboard the CPU is a consistent 3600MHz. (I'm running an Intel Core i9-9900k # 3.6GHz on Windows 10 x64).
After setting that... still no change.
Next thing that occurred to me is that with the 'printf' I have a call out to the C-runtime library between each loop, which is a different DLL between Debug and Release builds. To remove any other variation I starting building from the command-line instead of VS. Compiling with maximum speed optimizations and the release CRT DLLs (/O2 and /MD respectively), I still see the same slow-down. Switching to the debug CRT DLLs, I see some improvement. If I switch static linking in the CRT, then it doesn't matter if I use the debug or release versions, or if I compile with optimizations or not, I regularly see the 24 cycles per call, i.e.
ml64 /c ..\x64simd.asm
cl.exe /Od /MT /Feapp.exe ..\main.cpp x64simd.obj
>app.exe
Sin(0.00314159) = 0.00314159. Took 24 cycles
Sin(1.56765473) = 0.99984086. Took 24 cycles
Sin(0.78539819) = 0.70710647. Took 24 cycles
Sin(0.00010000) = 0.00010000. Took 24 cycles
Sin(1.57079637) = 0.99984306. Took 24 cycles
So it's definitely something in calling out to the CRT Release DLLs causing the slow-down. I'm still puzzled as to why, especially as the Debug build in VS is also using CRT via DLLs.

You're timing in reference cycles with rdtsc, not core clock cycles. It's probably the same speed both times, in core clock cycles, but with the CPU running at different frequencies.
Probably a debug build gives the CPU time to ramp up to max turbo (more core cycles per reference cycle) before your function gets called. Because the calling code compiles to slower asm. And especially with MSVC, a debug build adds extra stuff like poisoning the stack frame to catch use of uninitialized vars. And also overhead for incremental linking.
None of this slows down your hand-written function itself, it's just "warm up" that you neglected to do manually in your microbenchmark.
See How to get the CPU cycle count in x86_64 from C++? for lots more details about RDTSC.
A factor of ~3 between idle CPU clock and max-turbo (or some higher clock) is very plausible for modern x86 CPUs. My i7-6700k idles at 0.8GHz with rated frequency of 4.0GHz, max single-core turbo of 4.2. But many laptop CPUs much lower non-turbo max (and might only ramp to non-turbo initially, not max turbo right away, depending on energy_performance_preference HW governor, or especially software governor on older CPUs.)
As a bonus question, why is the very first iteration typically in the thousands!!
Probably dTLB miss and cache miss for loading rf3 from data memory. You could try loading those from C (by declaring extern volatile float rf3) to prime the TLB + cache for that block of constants, assuming they're all in the same cache line.
Possibly also an I-cache miss after the rdtsc, but the first load is probably before the end of an I-cache line so those could happen in parallel. (Putting the rdtsc inside your asm function means we probably aren't waiting for an iTLB miss or i-cache miss inside the timed region to even fetch the first byte of the function).
Code review:
Don't use movss between XMM registers unless you want to blend the low 4 bytes into the old value of the destination. Use movaps xmm2, xmm0 to copy the whole register; it's much more efficient.
movaps can be handled by register renaming without needing any back-end execution unit, vs. movss only running on one execution unit in Intel CPUs, port 5. https://agner.org/optimize/. Also, movaps avoids a false dependency on the old value of the register because it overwrites the full reg, allowing out-of-order exec to work properly.
movss xmm, [mem] is fine, though: as a load it zero-extends into the full register.

Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?

I have the following code:
#include <iostream>
#include <chrono>
#define ITERATIONS "10000"
int main()
{
/*
======================================
The first case: the MOV is outside the loop.
======================================
*/
auto t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $200, %ebx\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time1:\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time1\n");
auto t2 = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << time;
/*
======================================
The second case: the MOV is inside the loop (faster).
======================================
*/
t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time2:\n"
" mov $200, %ebx\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time2\n");
t2 = std::chrono::high_resolution_clock::now();
time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << '\n' << time << '\n';
}
The first case
I compiled it with
gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu
gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp
and its output is
14474
5837
I also compiled it with Clang with the same result.
So, why the second case is faster (almost 3x speedup)? Does it actually related with some microarchitectural details? If it matters, I have an AMD's CPU: “AMD A9-9410 RADEON R5, 5 COMPUTE CORES 2C+3G”.

mov $200, %ebx inside the loop breaks the loop-carried dependency chain through ebx, allowing out-of-order execution to overlap the chain of 5 add instructions across multiple iterations.
Without it, the chain of add instructions bottlenecks the loop on the latency of the add (1 cycle) critical path, instead of the throughput (4/cycle on Excavator, improved from
2/cycle on Steamroller). Your CPU is an Excavator core.
AMD since Bulldozer has an efficient loop instruction (only 1 uop), unlike Intel CPUs where loop would bottleneck either loop at 1 iteration per 7 cycles. (https://agner.org/optimize/ for instruction tables, microarch guide, and more details on everything in this answer.)
With loop and mov taking slots in the front-end (and back-end execution units) away from add, a 3x instead of 4x speedup looks about right.
See this answer for an intro to how CPUs find and exploit Instruction Level Parallelism (ILP).
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for some in-depth details about overlapping independent dep chains.
BTW, 10k iterations is not many. Your CPU might not even ramp up out of idle speed in that time. Or might jump to max speed for most of the 2nd loop but none of the first. So be careful with microbenchmarks like this.
Also, your inline asm is unsafe because you forgot to declare clobbers on EAX, EBX, and ECX. You step on the compiler's registers without telling it. Normally you should always compile with optimization enabled, but your code would probably break if you did that.

Convert Pentium II timing code into inline assembly?

I am trying to use the following code in GCC. It is throwing errors(I guess because of __asm). Why is this simple and easy format is not working in GCC? Syntax of extended assembly is provided here. I am getting confused, when it comes to use of more variables in the inline assembly. Can some one convert the following program to appropriate form and give necessary explanation where ever there is use of variables.
int time, subtime;
float x = 5.0f;
__asm {
cpuid
rdtsc
mov subtime, eax
cpuid
rdtsc
sub eax, subtime
mov subtime, eax // Only the last value of subtime is kept
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions
fld x
fld x
cpuid // Serialize execution
rdtsc // Read time stamp to EAX
mov time, eax
fdiv // Perform division
cpuid // Serialize again for time-stamp read
rdtsc
sub eax, time // Find the difference
mov time, eax
}
.

Your question is effectively a code conversion question, which is generally off-topic for Stackoverflow. An answer however may be beneficial to other readers.
This code is a conversion of the original source material, and is not meant as an enhancement. The actual FDIV/FDIVP and the FLD can be reduced to a single FLD and a FDIV/FDIVP since you are dividing a float value by itself. As Peter Cordes points out though, you can just load the top of stack with a value 1.0 with FLD1. This would work since dividing any number by itself (besides 0.0) will take the same time as dividing 5.0 by itself. This would remove the need for passing the variable x into the assembler template.
The code you are using is a variation of what was documented by Intel 20 years ago for the Pentium IIs. A discussion of what is going on for that processor is described. The variation is that the code you are using doesn't do the warm up described in that document. I do not believe this mechanism will work overly well on modern processors and OSes (be warned).
The code in question is intended to measure time it takes for a single FDIV instruction to complete. Assuming you actually want to convert this specific code you will have to use GCC extended assembler templates. Extended assembler templates are not easy to use for a first time GCC developer. For assembler code you might even consider putting the code into a separate assembly file, assemble it separately, and call it from C.
Assembler templates use input constraints and output constraints to pass data into and out of the template (unlike MSVC).It also uses a clobber list to specify registers that may have been altered that don't appear as an input or output. By default GCC inline assembly uses ATT syntax instead of INTEL.
The equivalent code using extended assembler with ATT syntax could look like this:
#include <stdio.h>
int main()
{
int time, subtime;
float x = 5.0f;
int temptime;
__asm__ (
"rdtsc\n\t"
"mov %%eax, %[subtime]\n\t"
"cpuid\n\t"
"rdtsc\n\t"
"sub %[subtime], %%eax\n\t"
"mov %%eax, %[subtime]\n\t"
/* Only the last value of subtime is kept
* subtime should now represent the overhead cost of the
* MOV and CPUID instructions */
"flds %[x]\n\t"
"flds %[x]\n\t" /* Alternatively use fst to make copy */
"cpuid\n\t" /* Serialize execution */
"rdtsc\n\t" /* Read time stamp to EAX */
"mov %%eax, %[temptime]\n\t"
"fdivp\n\t" /* Perform division */
"cpuid\n\t" /* Serialize again for time-stamp read */
"rdtsc\n\t"
"sub %[temptime], %%eax\n\t"
"fstp %%st(0)\n\t" /* Need to clear FPU stack before returning */
: [time]"=a"(time), /* 'time' is returned via the EAX register */
[subtime]"=r"(subtime), /* return reg for subtime */
[temptime]"=r"(temptime) /* Temporary reg for computation
This allows compiler to choose
a register for temporary use. Register
only for BOTH so subtime and temptime
calc are based on a mov reg, reg */
: [x]"m"(x) /* X is a MEMORY reference (required by FLD) */
: "ebx", "ecx", "edx"); /* Registers clobbered by CPUID
but not listed as input/output
operands */
time = time - subtime; /* Subtract the overhead */
printf ("%d\n", time); /* Print total time of divide to screen */
return 0;
}

gcc, icc and visual c, they all have very different syntax for inline assembler (This is not part of the C standard). The GCC is a bit more complex, but also more efficient, since you tell the compiler which registers are used for what, and which registers that are clobbered (used).
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
https://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
http://asm.sourceforge.net/articles/rmiyagi-inline-asm.txt
My gcc assembler is a bit rusty (a couple of years since I played with it), so there might be some mistakes there
int main(int argc, char *argv[])
{
int time=0, subtime = 100;
const float x = 5.0f;
asm (
"xorl %%eax, %%eax \n" /* make sure eax is a known value befeore cpuid */
"cpuid \n"
"rdtsc \n"
"movl %%eax, %[aSubtime] \n"
"cpuid \n"
"rdtsc \n"
"subl %[aSubtime], %%eax \n"
// subtime should now represent the overhead cost of the
// MOV and CPUID instructions
"fld %[ax] \n"
"fld %[ax] \n"
"cpuid \n" // Serialize execution
"rdtsc \n" // Read time stamp to EAX
"movl %%eax, %[atime] \n"
"fdivp \n" // Perform division
"cpuid \n" // Serialize again for time-stamp read
"rdtsc \n"
"subl %[atime], %%eax \n"
// "movl %%eax, %2 \n" Not needed, since we tell the compiler that asm exists with time in eax
: "=a" (time) /* time is outputed in eax */
: [aSubtime] "m" (subtime),
[ax] "m" (x),
[atime] "m" (time)
: "ebx", "ecx", "edx"
);
/* FPU is currently left in a pushed state here */
return 0;
}

gcc inline asm without instruction

Is it "legal" to have a gcc inline asm statement without the actual instruction?
For example, is the asm statement "legal"? Will it introduce undefined behaviour?
int main(){
int *p = something;
asm("":"=m"(p));
return 0;
}

int main(){
int *p = 0;
asm("":"=m"(p));
return 0;
}
Compiles without any errors, but it is unnecessary:
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movq $0, -8(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
GCC completely ignores that empty asm statement.

In that specific example, the compiler can see that none of the outputs of the asm (ie p) are ever used. Since the asm is not volatile (and has at least 1 output), the compiler is free to completely discard the statement during optimization.
It may also be worth mentioning that on i386, all extended asm statements (ie ones with parameters) always implicitly clobber fpsr (floating point flags) and eflags (think: cc clobber). If the statement isn't completely discarded (for example if it is volatile), this might have an effect. If it does, it will at worst be a tiny loss of efficiency, not incorrect results.
So to sum up:
Yes it is legal.
No it doesn't introduce undefined behavior EXCEPT that the value of p could be undefined since you are saying that you are overwriting the contents, but you aren't actually putting anything in it.
It can, theoretically, introduce tiny inefficiencies, but it probably won't.

cmpxchg example for 64 bit integer

I am using cmpxchg (compare-and-exchange) in i686 architecture for 32 bit compare and swap as follows.
(Editor's note: the original 32-bit example was buggy, but the question isn't about it. I believe this version is safe, and as a bonus compiles correctly for x86-64 as well. Also note that inline asm isn't needed or recommended for this; __atomic_compare_exchange_n or the older __sync_bool_compare_and_swap work for int32_t or int64_t on i486 and x86-64. But this question is about doing it with inline asm, in case you still want to.)
// note that this function doesn't return the updated oldVal
static int CAS(int *ptr, int oldVal, int newVal)
{
unsigned char ret;
__asm__ __volatile__ (
" lock\n"
" cmpxchgl %[newval], %[mem]\n"
" sete %0\n"
: "=q" (ret), [mem] "+m" (*ptr), "+a" (oldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
What is the equivalent for x86_64 architecture for 64 bit compare and swap
static int CAS(long *ptr, long oldVal, long newVal)
{
unsigned char ret;
// ?
return ret;
}

The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.
There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.
But, as stated, cmpxchgq is probably the better option for 64-bit code.
If you need to cmpxchg a 16 byte object, the 64-bit version of cmpxchg8b is cmpxchg16b. It was missing from the very earliest AMD64 CPUs, so compilers won't generate it for std::atomic::compare_exchange on 16B objects unless you enable -mcx16 (for gcc). Assemblers will assemble it, though, but beware that your binary won't run on the earliest K8 CPUs. (This only applies to cmpxchg16b, not to cmpxchg8b in 64-bit mode, or to cmpxchgq).

cmpxchg8b
__forceinline int64_t interlockedCompareExchange(volatile int64_t & v,int64_t exValue,int64_t cmpValue)
{
__asm {
mov esi,v
mov ebx,dword ptr exValue
mov ecx,dword ptr exValue + 4
mov eax,dword ptr cmpValue
mov edx,dword ptr cmpValue + 4
lock cmpxchg8b qword ptr [esi]
}
}

The x64 architecture supports a 64-bit compare-exchange using the good, old cmpexch instruction. Or you could also use the somewhat more complicated cmpexch8b instruction (from the "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"):
The CMPXCHG instruction compares a
value in the AL or rAX register with
the first (destination) operand, and
sets the arithmetic flags (ZF, OF, SF,
AF, CF, PF) according to the result.
If the compared values are equal, the
source operand is loaded into the
destination operand. If they are not
equal, the first operand is loaded
into the accumulator. CMPXCHG can be
used to try to intercept a semaphore,
i.e. test if its state is free, and if
so, load a new value into the
semaphore, making its state busy. The
test and load are performed
atomically, so that concurrent
processes or threads which use the
semaphore to access a shared object
will not conflict.
The CMPXCHG8B
instruction compares the 64-bit values
in the EDX:EAX registers with a 64-bit
memory location. If the values are
equal, the zero flag (ZF) is set, and
the ECX:EBX value is copied to the
memory location. Otherwise, the ZF
flag is cleared, and the memory value
is copied to EDX:EAX.
The CMPXCHG16B
instruction compares the 128-bit value
in the RDX:RAX and RCX:RBX registers
with a 128-bit memory location. If the
values are equal, the zero flag (ZF)
is set, and the RCX:RBX value is
copied to the memory location.
Otherwise, the ZF flag is cleared, and
the memory value is copied to rDX:rAX.
Different assembler syntaxes may need to have the length of the operations specified in the instruction mnemonic if the size of the operands can't be inferred. This may be the case for GCC's inline assembler - I don't know.

usage of cmpxchg8B from AMD64 Architecture Programmer's Manual V3:
Compare EDX:EAX register to 64-bit memory location. If equal, set the zero flag (ZF) to 1 and copy the ECX:EBX register to the memory location. Otherwise,
copy the memory location to EDX:EAX and clear the zero flag.
I use cmpxchg8B to implement a simple mutex lock function in x86-64 machine. here is the code
.text
.align 8
.global mutex_lock
mutex_lock:
pushq %rbp
movq %rsp, %rbp
jmp .L1
.L1:
movl $0, %edx
movl $0, %eax
movl $0, %ecx
movl $1, %ebx
lock cmpxchg8B (%rdi)
jne .L1
popq %rbp
ret

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio