cmpxchg example for 64 bit integer - gcc

I am using cmpxchg (compare-and-exchange) in i686 architecture for 32 bit compare and swap as follows.
(Editor's note: the original 32-bit example was buggy, but the question isn't about it. I believe this version is safe, and as a bonus compiles correctly for x86-64 as well. Also note that inline asm isn't needed or recommended for this; __atomic_compare_exchange_n or the older __sync_bool_compare_and_swap work for int32_t or int64_t on i486 and x86-64. But this question is about doing it with inline asm, in case you still want to.)
// note that this function doesn't return the updated oldVal
static int CAS(int *ptr, int oldVal, int newVal)
{
unsigned char ret;
__asm__ __volatile__ (
" lock\n"
" cmpxchgl %[newval], %[mem]\n"
" sete %0\n"
: "=q" (ret), [mem] "+m" (*ptr), "+a" (oldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
What is the equivalent for x86_64 architecture for 64 bit compare and swap
static int CAS(long *ptr, long oldVal, long newVal)
{
unsigned char ret;
// ?
return ret;
}

The x86_64 instruction set has the cmpxchgq (q for quadword) instruction for 8-byte (64 bit) compare and swap.
There's also a cmpxchg8b instruction which will work on 8-byte quantities but it's more complex to set up, needing you to use edx:eax and ecx:ebx rather than the more natural 64-bit rax. The reason this exists almost certainly has to do with the fact Intel needed 64-bit compare-and-swap operations long before x86_64 came along. It still exists in 64-bit mode, but is no longer the only option.
But, as stated, cmpxchgq is probably the better option for 64-bit code.
If you need to cmpxchg a 16 byte object, the 64-bit version of cmpxchg8b is cmpxchg16b. It was missing from the very earliest AMD64 CPUs, so compilers won't generate it for std::atomic::compare_exchange on 16B objects unless you enable -mcx16 (for gcc). Assemblers will assemble it, though, but beware that your binary won't run on the earliest K8 CPUs. (This only applies to cmpxchg16b, not to cmpxchg8b in 64-bit mode, or to cmpxchgq).

cmpxchg8b
__forceinline int64_t interlockedCompareExchange(volatile int64_t & v,int64_t exValue,int64_t cmpValue)
{
__asm {
mov esi,v
mov ebx,dword ptr exValue
mov ecx,dword ptr exValue + 4
mov eax,dword ptr cmpValue
mov edx,dword ptr cmpValue + 4
lock cmpxchg8b qword ptr [esi]
}
}

The x64 architecture supports a 64-bit compare-exchange using the good, old cmpexch instruction. Or you could also use the somewhat more complicated cmpexch8b instruction (from the "AMD64 Architecture Programmer's Manual Volume 1: Application Programming"):
The CMPXCHG instruction compares a
value in the AL or rAX register with
the first (destination) operand, and
sets the arithmetic flags (ZF, OF, SF,
AF, CF, PF) according to the result.
If the compared values are equal, the
source operand is loaded into the
destination operand. If they are not
equal, the first operand is loaded
into the accumulator. CMPXCHG can be
used to try to intercept a semaphore,
i.e. test if its state is free, and if
so, load a new value into the
semaphore, making its state busy. The
test and load are performed
atomically, so that concurrent
processes or threads which use the
semaphore to access a shared object
will not conflict.
The CMPXCHG8B
instruction compares the 64-bit values
in the EDX:EAX registers with a 64-bit
memory location. If the values are
equal, the zero flag (ZF) is set, and
the ECX:EBX value is copied to the
memory location. Otherwise, the ZF
flag is cleared, and the memory value
is copied to EDX:EAX.
The CMPXCHG16B
instruction compares the 128-bit value
in the RDX:RAX and RCX:RBX registers
with a 128-bit memory location. If the
values are equal, the zero flag (ZF)
is set, and the RCX:RBX value is
copied to the memory location.
Otherwise, the ZF flag is cleared, and
the memory value is copied to rDX:rAX.
Different assembler syntaxes may need to have the length of the operations specified in the instruction mnemonic if the size of the operands can't be inferred. This may be the case for GCC's inline assembler - I don't know.

usage of cmpxchg8B from AMD64 Architecture Programmer's Manual V3:
Compare EDX:EAX register to 64-bit memory location. If equal, set the zero flag (ZF) to 1 and copy the ECX:EBX register to the memory location. Otherwise,
copy the memory location to EDX:EAX and clear the zero flag.
I use cmpxchg8B to implement a simple mutex lock function in x86-64 machine. here is the code
.text
.align 8
.global mutex_lock
mutex_lock:
pushq %rbp
movq %rsp, %rbp
jmp .L1
.L1:
movl $0, %edx
movl $0, %eax
movl $0, %ecx
movl $1, %ebx
lock cmpxchg8B (%rdi)
jne .L1
popq %rbp
ret

Related

Is movzbl followed by testl faster than testb?

Consider this C code:
int f(void) {
int ret;
char carry;
__asm__(
"nop # do something that sets eax and CF"
: "=a"(ret), "=#ccc"(carry)
);
return carry ? -ret : ret;
}
When I compile it with gcc -O3, I get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
negl %edx
testb %cl, %cl
cmovne %edx, %eax
ret
If I change char carry to int carry, I instead get this:
f:
nop # do something that sets eax and CF
setc %cl
movl %eax, %edx
movzbl %cl, %ecx
negl %edx
testl %ecx, %ecx
cmovne %edx, %eax
ret
That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:
f:
nop # do something that sets eax and CF
jnc .L1
negl %eax
.L1:
ret
It seems like one of two things must be true, but I'm not sure which:
A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.
My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.
By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.
There's no downside I'm aware of to reading a byte register with test vs. movzb.
If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.
Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.

All the calculations take place in registers. Why is the stack not storing the result of the register computation here

I am debugging a simple code in c++ and, looking at the disassembly.
In the disassembly, all the calculations are done in the registers. And later, the result of the operation is returned. I only see the a and b variables being pushed onto the stack (the code is below). I don't see the resultant c variable pushed onto the stack. Am I missing something?
I researched on the internet. But on the internet it looks like all variables a,b and c should be pushed onto the stack. But in my Disassembly, I don't see the resultant variable c being pushed onto the stack.
C++ code:
#include<iostream>
using namespace std;
int AddMe(int a, int b)
{
int c;
c = a + b;
return c;
}
int main()
{
AddMe(10, 20);
return 0;
}
Relevant assembly code:
int main()
{
00832020 push ebp
00832021 mov ebp,esp
00832023 sub esp,0C0h
00832029 push ebx
0083202A push esi
0083202B push edi
0083202C lea edi,[ebp-0C0h]
00832032 mov ecx,30h
00832037 mov eax,0CCCCCCCCh
0083203C rep stos dword ptr es:[edi]
0083203E mov ecx,offset _E7BF1688_Function#cpp (0849025h)
00832043 call #__CheckForDebuggerJustMyCode#4 (083145Bh)
AddMe(10, 20);
00832048 push 14h
0083204A push 0Ah
0083204C call std::operator<<<std::char_traits<char> > (08319FBh)
00832051 add esp,8
return 0;
00832054 xor eax,eax
}
As seen above, 14h and 0Ah are pushed onto the stack - corresponding to AddMe(10, 20);
But, when we look at the disassembly for the AddMe function, we see that the variable c (c = a + b), is not pushed onto the stack.
snippet of AddMe in Disassembly:
…
int c;
c = a + b;
00836028 mov eax,dword ptr [a]
0083602B add eax,dword ptr [b]
0083602E mov dword ptr [c],eax
return c;
00836031 mov eax,dword ptr [c]
}
shouldn't c be pushed to the stack in this program? Am I missing something?
All the calculations take place in registers.
Well yes, but they're stored afterwards.
Using memory-destination add instead of just using the accumulator register (EAX) would be an optimization. And one that's impossible when when the result needs to be in a different location than any of the inputs to an expression.
Why is the stack not storing the result of the register computation here
It is, just not with push
You compiled with optimization disabled (debug mode) so every C object really does have its own address in the asm, and is kept in sync between C statements. i.e. no keeping C variables in registers. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?). This is one reason why debug mode is extra slow: it's not just avoiding optimizations, it's forcing store/reload.
But the compiler uses mov not push because it's not a function arg. That's a missed optimization that all compilers share, but in this case it's not even trying to optimize. (What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?). It would certainly be possible for the compiler to reserve space for c in the same instruction as storing it, using push. But compilers instead to stack-allocation for all locals on entry to a function with one sub esp, constant.
Somewhere before the mov dword ptr [c],eax that spills c to its stack slot, there's a sub esp, 12 or something that reserves stack space for c. In this exact case, MSVC uses a dummy push to reserve 4 bytes space, as an optimization over sub esp, 4.
In the MSVC asm output, the compiler will emit a c = ebp-4 line or something that defines c as a text substitution for ebp-4. If you looked at disassembly you'd just see [ebp-4] or whatever addressing mode instead of.
In MSVC asm output, don't assume that [c] refers to static storage. It's actually still stack space as expected, but using a symbolic name for the offset.
Putting your code on the Godbolt compiler explorer with 32-bit MSVC 19.22, we get the following asm which only uses symbolic asm constants for the offset, not the whole addressing mode. So [c] might just be that form of listing over-simplifying even further.
_c$ = -4 ; size = 4
_a$ = 8 ; size = 4
_b$ = 12 ; size = 4
int AddMe(int,int) PROC ; AddMe
push ebp
mov ebp, esp ## setup a legacy frame pointer
push ecx # RESERVE 4B OF STACK SPACE FOR c
mov eax, DWORD PTR _a$[ebp]
add eax, DWORD PTR _b$[ebp] # c = a+b
mov DWORD PTR _c$[ebp], eax # spill c to the stack
mov eax, DWORD PTR _c$[ebp] # reload it as the return value
mov esp, ebp # restore ESP
pop ebp # tear down the stack frame
ret 0
int AddMe(int,int) ENDP ; AddMe
The __cdecl calling convention, which AddMe() uses by default (depending on the compiler's configuration), requires parameters to be passed on the stack. But there is nothing requiring local variables to be stored on the stack. The compiler is allowed to use registers as an optimization, as long as the intent of the code is preserved.

gcc inline assembly behave strangely

I am learning GCC's extended inline assembly currently. I wrote an A + B function and wants to detect the ZF flag, but things behave strangely.
The compiler I use is gcc 7.3.1 on x86-64 Arch Linux.
I started from the following code, this code will correctly print the a + b.
int a, b, sum;
scanf("%d%d", &a, &b);
asm volatile (
"movl %1, %0\n"
"addl %2, %0\n"
: "=r"(sum)
: "r"(a), "r"(b)
: "cc"
);
printf("%d\n", sum);
Then I simply added a variable to check flags, it gives me wrong output.
int a, b, sum, zero;
scanf("%d%d", &a, &b);
asm volatile (
"movl %2, %0\n"
"addl %3, %0\n"
: "=r"(sum), "=#ccz"(zero)
: "r"(a), "r"(b)
: "cc"
);
printf("%d %d\n", sum, zero);
The GAS assembly output is
movl -24(%rbp), %eax # %eax = a
movl -20(%rbp), %edx # %edx = b
#APP
# 6 "main.c" 1
movl %eax, %edx
addl %edx, %edx
# 0 "" 2
#NO_APP
sete %al
movzbl %al, %eax
movl %edx, -16(%rbp) # sum = %edx
movl %eax, -12(%rbp) # zero = %eax
This time, the sum will become a + a. But when I just exchanged %2 and %3, the output will be correct a + b.
Then I checked various gcc version (It seems clang does not support it when output is a flag) on wandbox.org, from version 4.5.4 to version 4.7.4 gives the correct result a + b, and starting from version 4.8.1 the outputs are all a + a.
My question is: did I write the wrong code or is there anything wrong with gcc?
The problem is that you clobber %0 before all the inputs (%2 in your case) are consumed:
"movl %1, %0\n"
"addl %2, %0\n"
%0 is being modified by the first MOV before %2 has been consumed. It is possible for an optimizing compiler to re-use a register for an input constraint that was used for an output constraint. In your case one of the compilers chose to use the same register for %2 and %0 which caused the erroneous results.
To get around this problem of changing a register that is being modified before all the inputs are consumed is to mark the output constraint with a &. The & is a modifier denoting Early Clobber:
‘&’
Means (in a particular alternative) that this operand is an earlyclobber operand, which is written before the instruction is finished using the input operands. Therefore, this operand may not lie in a register that is read by the instruction or as part of any memory address.
‘&’ applies only to the alternative in which it is written. In constraints with multiple alternatives, sometimes one alternative requires ‘&’ while others do not. See, for example, the ‘movdf’ insn of the 68000.
A operand which is read by the instruction can be tied to an earlyclobber operand if its only use as an input occurs before the early result is written. Adding alternatives of this form often allows GCC to produce better code when only some of the read operands can be affected by the earlyclobber. See, for example, the ‘mulsi3’ insn of the ARM.
Furthermore, if the earlyclobber operand is also a read/write operand, then that operand is written only after it’s used.
‘&’ does not obviate the need to write ‘=’ or ‘+’. As earlyclobber operands are always written, a read-only earlyclobber operand is ill-formed and will be rejected by the compiler.
The change to your code would be to modify "=r"(sum) to be "=&r"(sum). This will prevent the compiler from using the register used for the output constraint for one of the input constraints.
Word of warning. GCC Inline Assembly is powerful and evil. Very easy to get wrong if you don't know what you are doing. Only use it if you must, avoid it if you can.

Assembler instruction: rdtsc

Could someone help me understand the assembler given in https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
It goes like this:
uint64_t msr;
asm volatile ( "rdtsc\n\t" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n\t" // Shift the upper bits left.
"or %%rdx, %0" // 'Or' in the lower bits.
: "=a" (msr)
:
: "rdx");
How is it different from:
uint64_t msr;
asm volatile ( "rdtsc\n\t"
: "=a" (msr));
Why do we need shift and or operations and what does rdx at the end do?
EDIT: added what is still unclear to the original question.
What does "\n\t" do?
What do ":" do?
delimiters output/input/clobbers...
Is rdx at the end equal to 0?
Just to recap. First line loads the timestamp in registers eax and edx. Second line shifts the value in eax and stores in rdx. Third line ors the value in edx with the value in rdx and saves it in rdx. Fourth line assigns the value in rdx to my variable. The last line sets rdx to 0.
Why are the first three lines without ":"?
They are a template. First line with ":" is output, second is optional input and third one is optional list of clobbers (changed registers).
Is a actually eax and d - edx? Is this hard-coded?
Thanks again! :)
EDIT2: Answered some of my questions...
uint64_t msr;
asm volatile ( "rdtsc\n\t" // Returns the time in EDX:EAX.
"shl $32, %%rdx\n\t" // Shift the upper bits left.
"or %%rdx, %0" // 'Or' in the lower bits.
: "=a" (msr)
:
: "rdx");
Because the rdtsc instruction returns it's results in edx and eax, instead of a straight 64-bit register on a 64-bit machine (See the intel system's programming manual for more information; it's an x86 instruction), the 2nd
instruction shifts the rdx register to the left 32 bits so that edx will be on the upper 32 bits instead of the lower 32 bits.
"=a" (msr) will move the contents of eax into msr (the %0), i.e. into the lower 32 bits of it, so in total you have edx (higher 32 bits) and eax (lower 32 bits) into rdx which is msr.
rdx is a clobber which will represent the msr C variable.
It's similar to doing the following in C:
static inline uint64_t rdtsc(void)
{
uint32_t eax, edx;
asm volatile("rdtsc\n\t", "=a" (eax), "=d" (edx));
return (uint64_t)eax | (uint64_t)edx << 32;
}
And:
uint64_t msr;
asm volatile ( "rdtsc\n\t"
: "=a" (msr));
This one, will just give you the contents of eax into msr.
EDIT:
1) "\n\t" is for the generated assembly to look clearer and error-free, so that you don't end up with things like movl $1, %eaxmovl $2, %ebx
2) Is rdx at the end equal to 0? The left shift does this, it removes the bits that are already in rdx.
3) Is a actually eax and d - edx? Is this hard-coded? Yes, there is a table that describes what characters represents what register, e.g. "D" would be rdi, "c" would be ecx, ...
rdtsc returns timestamp in a pair of 32-bit registers (EDX and EAX). First snippet combines them into single 64-bit register (RDX) which is mapped to msr variable.
Second snippet is the wrong one. I'm not sure about what will happen: either it won't be compiled at all, or only part of msr variable will be updated.

Predict cache line length using time measure

Hi it's my first post here so Hi everyone.
I have problem with predict cache line size in GNU AS. I wrote program in C which calls a function written in assembly.
here is this function
.section .text
.section .data
.global time
time:
pushl %ebp
xor %edx, %edx
xor %eax, %eax
CPUID
RDTSC
popl %ebp
ret
It measure CPU cycles
C code is:
#include <stdio.h>
const int size = 256;
void main(){
unsigned long long cykl, cykl1, cykl2;
unsigned char matrix[size];
char bla;
int i,j,k;
for(i=0 ; i<size; i++)
{
cykl1 = time();
bla = matrix[i];
cykl2 = time();
cykl = cykl2 - cykl1;
printf("i=%d: %lld \n",i, cykl);
}
}
I ran this program, but I can't see any time difference. As I know my cache line lenght is 64 bytes.
Time should rise every time I load next 64bytes of array, am I right?
I will be gratefull for any advice why it can't work properly.
I think there are 3 problems.
First, what your program call might not be your assembly routine, but time(2) system call which returns the current time in seonds.
The assembly routine name should be prefixed with an underscore, i.e., _time, in your *.s file.
You can also declare the routine with __asm__ keyword. See:
http://gcc.gnu.org/onlinedocs/gcc-4.8.0/gcc/Asm-Labels.html
Second, the memory access might be eliminated (or reordered) by GCC optimizer.
You should check the assembly code generated from your C code.
Third, RDTSC is not serialized.
In other words, the CPU might reorder an access to memory and an RDTSC insturction, or they are executed in parallel.
You should insert some instructions to prevent the reordering.
See the description of RDTSC in "Intel® 64 and IA-32 Architecture Software Developer's Manual Volume 2B: Instruction Set Reference, M-Z":
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Resources