Replacing #pragma omp atomic with c++ atomics - c++11

I'm replacing some OpenMP code with standard C++11/C++14 atomics/thread support. Here is the OpenMP minimal code example:
#include <vector>
#include <cstdint>
void omp_atomic_add(std::vector<std::int64_t> const& rows,
std::vector<std::int64_t> const& cols,
std::vector<double>& values,
std::size_t const row,
std::size_t const col,
double const value)
{
for (auto i = rows[row]; i < rows[row+1]; ++i)
{
if (cols[i] == col)
{
#pragma omp atomic
values[i] += value;
return;
}
}
}
The code updates a CSR matrix format and occurs in a hot path for scientific computation. It is technically possible to use a std::mutex but the values vector can have millions of elements and is accessed many times more than that so a std::mutex is too heavy.
Checking the assembly https://godbolt.org/g/nPE9Dt, it seems to use CAS (with the disclaimer my atomic and assembly knowledge is severely limited so my comments are likely incorrect):
mov rax, qword ptr [rdi]
mov rdi, qword ptr [rax + 8*rcx]
mov rax, qword ptr [rax + 8*rcx + 8]
cmp rdi, rax
jge .LBB0_6
mov rcx, qword ptr [rsi]
.LBB0_2: # =>This Inner Loop Header: Depth=1
cmp qword ptr [rcx + 8*rdi], r8
je .LBB0_3
inc rdi
cmp rdi, rax
jl .LBB0_2
jmp .LBB0_6
#### Interesting stuff happens from here onwards
.LBB0_3:
mov rcx, qword ptr [rdx] # Load values pointer into register
mov rax, qword ptr [rcx + 8*rdi] # Offset to value[i]
.LBB0_4: # =>This Inner Loop Header: Depth=1
movq xmm1, rax # Move value into floating point register
addsd xmm1, xmm0 # Add function arg to the value from the vector<double>
movq rdx, xmm1 # Move result to register
lock # x86 lock
cmpxchg qword ptr [rcx + 8*rdi], rdx # Compare exchange on the value in the vector
jne .LBB0_4 # If failed, go back to the top and try again
.LBB0_6:
ret
Is this possible to do using C++ atomics? The examples I've seen only use std::atomic<double> value{} and nothing in the context of accessing a value through a pointer.

You can create a std::vector<std::atomic<double>> but you cannot change its size.
The first thing I'd do is get gsl::span or write my own variant. Then gsl::span<std::atomic<double>> is a better model for values than std::vector<std::atomic<double>>.
Once we have done that, simply remove the #pragma omp atomic and your code is atomic in c++20. In c++17 and before you have to manually implement +=.
double old = values[i];
while(!values[i].compare_exchange_weak(old, old+value))
{}
Live example.
Clang 5 generates:
omp_atomic_add(std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<std::atomic<double>, std::allocator<std::atomic<double> > >&, unsigned long, unsigned long, double): # #omp_atomic_add(std::vector<long, std::allocator<long> > const&, std::vector<long, std::allocator<long> > const&, std::vector<std::atomic<double>, std::allocator<std::atomic<double> > >&, unsigned long, unsigned long, double)
mov rax, qword ptr [rdi]
mov rdi, qword ptr [rax + 8*rcx]
mov rax, qword ptr [rax + 8*rcx + 8]
cmp rdi, rax
jge .LBB0_6
mov rcx, qword ptr [rsi]
.LBB0_2: # =>This Inner Loop Header: Depth=1
cmp qword ptr [rcx + 8*rdi], r8
je .LBB0_3
inc rdi
cmp rdi, rax
jl .LBB0_2
jmp .LBB0_6
.LBB0_3:
mov rax, qword ptr [rdx]
mov rax, qword ptr [rax + 8*rdi]
.LBB0_4: # =>This Inner Loop Header: Depth=1
mov rcx, qword ptr [rdx]
movq xmm1, rax
addsd xmm1, xmm0
movq rsi, xmm1
lock
cmpxchg qword ptr [rcx + 8*rdi], rsi
jne .LBB0_4
.LBB0_6:
ret
which seems identical to my casual glance.
There is a proposal for atomic_view that lets you manipulate a non-atomic value through an atomic view. In general, C++ only lets you operate atomically on atomic data.

Related

How to deal with "undefined label XXX" in Go assembly for libc functions like malloc?

I found a project c2goasm that can convert assembly from a C compiler into Golang assembly, but I'm currently having some problems.
such as "linkedlist.c" :
void ListNodeCreat(int val, struct ListNode* ret) {
struct ListNode * node = (struct ListNode *)malloc(sizeof(struct ListNode));
node->val = val;
node->next = NULL;
ret = node;
}
The generated C assembly file "linkedlist.s" is as follows, in GNU assembler .intel_syntax noprefix
ListNodeCreat: # #ListNodeCreat
push rbp
mov rbp, rsp
and rsp, -16
sub rsp, 32
mov dword ptr [rsp + 28], edi
mov qword ptr [rsp + 16], rsi
mov edi, 16
call malloc
mov qword ptr [rsp + 8], rax
mov ecx, dword ptr [rsp + 28]
mov rax, qword ptr [rsp + 8]
mov dword ptr [rax], ecx
mov rax, qword ptr [rsp + 8]
mov qword ptr [rax + 8], 0
mov rax, qword ptr [rsp + 8]
mov qword ptr [rsp + 16], rax
mov rsp, rbp
pop rbp
ret
Pay attention to the "call malloc" in it,when using c2goasm to get go assembly "linkedlist_amd64.s", it still exists:
TEXT ·_ListNodeCreat(SB), $40-16
MOVQ val+0(FP), DI
MOVQ ret+8(FP), SI
ADDQ $8, SP
LONG $0x1c247c89 // mov dword [rsp + 28], edi
LONG $0x24748948; BYTE $0x10 // mov qword [rsp + 16], rsi
LONG $0x000010bf; BYTE $0x00 // mov edi, 16
CALL malloc
LONG $0x24448948; BYTE $0x08 // mov qword [rsp + 8], rax
LONG $0x1c244c8b // mov ecx, dword [rsp + 28]
LONG $0x24448b48; BYTE $0x08 // mov rax, qword [rsp + 8]
WORD $0x0889 // mov dword [rax], ecx
LONG $0x24448b48; BYTE $0x08 // mov rax, qword [rsp + 8]
QUAD $0x000000000840c748 // mov qword [rax + 8], 0
LONG $0x24448b48; BYTE $0x08 // mov rax, qword [rsp + 8]
LONG $0x24448948; BYTE $0x10 // mov qword [rsp + 16], rax
SUBQ $8, SP
RET
so when I run "go build" or "go tool asm linkedlist_amd64.s", I got:
linkedlist_amd64.s:28: undefined label malloc
asm: assembly of linkedlist_amd64.s failed
Does anyone know how to deal with it?

Why does LLVM appear to ignore Rust's assume intrinsic?

LLVM appears to ignore core::intrinsics::assume(..) calls. They do end up in the bytecode, but don't change the resulting machine code. For example take the following (nonsensical) code:
pub fn one(xs: &mut Vec<i32>) {
if let Some(x) = xs.pop() {
xs.push(x);
}
}
This compiles to a whole lot of assembly:
example::one:
push rbp
push r15
push r14
push r12
push rbx
mov rbx, qword ptr [rdi + 16]
test rbx, rbx
je .LBB0_9
mov r14, rdi
lea rsi, [rbx - 1]
mov qword ptr [rdi + 16], rsi
mov rdi, qword ptr [rdi]
mov ebp, dword ptr [rdi + 4*rbx - 4]
cmp rsi, qword ptr [r14 + 8]
jne .LBB0_8
lea rax, [rsi + rsi]
cmp rax, rbx
cmova rbx, rax
mov ecx, 4
xor r15d, r15d
mov rax, rbx
mul rcx
mov r12, rax
setno al
jo .LBB0_11
mov r15b, al
shl r15, 2
test rsi, rsi
je .LBB0_4
shl rsi, 2
mov edx, 4
mov rcx, r12
call qword ptr [rip + __rust_realloc#GOTPCREL]
mov rdi, rax
test rax, rax
je .LBB0_10
.LBB0_7:
mov qword ptr [r14], rdi
mov qword ptr [r14 + 8], rbx
mov rsi, qword ptr [r14 + 16]
.LBB0_8:
or ebp, 1
mov dword ptr [rdi + 4*rsi], ebp
add qword ptr [r14 + 16], 1
.LBB0_9:
pop rbx
pop r12
pop r14
pop r15
pop rbp
ret
.LBB0_4:
mov rdi, r12
mov rsi, r15
call qword ptr [rip + __rust_alloc#GOTPCREL]
mov rdi, rax
test rax, rax
jne .LBB0_7
.LBB0_10:
mov rdi, r12
mov rsi, r15
call qword ptr [rip + alloc::alloc::handle_alloc_error#GOTPCREL]
ud2
.LBB0_11:
call qword ptr [rip + alloc::raw_vec::capacity_overflow#GOTPCREL]
ud2
Now we could introduce the assumption that xs is not full (at capacity) after
the pop() (this is nightly only):
#![feature(core_intrinsics)]
pub fn one(xs: &mut Vec<i32>) {
if let Some(x) = xs.pop() {
unsafe {
core::intrinsics::assume(xs.len() < xs.capacity());
}
xs.push(x);
}
}
Yet despite the assume showing up in the LLVM bytecode, the assembly is
unchanged. If however, we use core::hint::unreachable_unchecked() to create
a diverging path in the non-assumed case, such as:
pub fn one(xs: &mut Vec<i32>) {
if let Some(x) = xs.pop() {
if xs.len() >= xs.capacity() {
unsafe { core::hint::unreachable_unchecked() }
}
xs.push(x);
}
}
We get the following:
example::one:
mov rax, qword ptr [rdi + 16]
test rax, rax
je .LBB0_2
mov qword ptr [rdi + 16], rax
.LBB0_2:
ret
Which is essentially a no-op, but not too bad. Of course, we could have left the value in place by using:
pub fn one(xs: &mut Vec<i32>) {
xs.last_mut().map(|_e| ());
}
Which compiles down to what we'd expect:
example::one:
ret
Why does LLVM appear to ignore the assume intrinsic?
This now compiles to just a ret on recent versions of rustc due to improvements in rustc and LLVM. LLVM ignored the intrinsic because it wasn't able to optimize it before, but now it has the ability to optimize this better.

frame pointer register 'ebx' modified by inline assembly code

Unfortunately, I had to re-image my laptop to install Visual Studio 2012. My project build but with above warning. Previously I had Visual Studio 2010 and I never got the above warning. The code is as follows:
__asm
{
//Initialize pointers on matrices
mov eax, dword ptr [this]
mov ebx, dword ptr [eax+UPkk]
mov dword ptr [UPkk_ptr],ebx
mov ebx, dword ptr [eax+UPk1k]
mov dword ptr [UPk1k_ptr],ebx
mov ebx, dword ptr [eax+DPk1k]
mov dword ptr [DPk1k_ptr],ebx
mov ebx, dword ptr [eax+DPkk]
mov dword ptr [DPkk_ptr],ebx
mov ebx, dword ptr [eax+mat_A]
mov dword ptr [mat_A_ptr],ebx
mov ebx, dword ptr [eax+vec_a]
mov dword ptr [vec_a_ptr],ebx
mov ebx, dword ptr [eax+vec_b]
mov dword ptr [vec_b_ptr],ebx
}
Do I need to change any settings in the project?
Best Regards
Chintan
Edit: In the above code when I replace ebx with ecx, the warnings go away and the code works fine. However, there is another piece of code where I have used ebx and ecx and in that case my program crashes. Here is the code:
__asm
{
//Initialize UPk1k[idx_4] pointer
mov eax, dword ptr [UPk1k_ptr]
mov ebx, dword ptr [idx_4]
imul ebx,8
add eax,ebx
mov dword ptr [UPk1k_id4_ptr],eax
//Initialize UPkk[idx_4] pointer
mov eax, dword ptr [UPkk_ptr]
mov ebx, dword ptr [idx_4]
imul ebx,8
add eax,ebx
mov dword ptr [UPkk_id4_ptr],eax
//Initialize UPk1k[idx_4] pointer
mov eax, dword ptr [vec_b_ptr]
mov ebx, dword ptr [idx_1]
imul ebx,8
add eax,ebx
mov dword ptr [vec_b_id1_ptr],eax
mov edi, dword ptr [idx_1] //Load idx_1 in edi
mov esi, 0 //initialize loop counter
jmp start_proc11
start_for11:inc esi //idx_2++
start_proc11:cmp esi, edi //idx_2<idx_1 ?
jge end_for11 //If yes so end of the loop
mov eax, UPk1k_id4_ptr //load UPk1k[idx_4] adress
mov ebx, vec_b_ptr //load vec_b adress
mov ecx, esi
imul ecx,8
add eax, ecx //UPk1k[idx_4+idx_2] in eax
add ebx, ecx //vec_b[idx_2] in eax
fld qword ptr [eax]//push UPk1k[idx_4+idx_2]
fld qword ptr [ebx] //push vec_b[idx_2]
mov edx,dword ptr [Sd_ptr]
fmul qword ptr [edx] //vec_b[idx_2]*Sd
fadd //pop UPk1k[idx_4+idx_2]+vec_b[idx_2]*Sd
mov edx,dword ptr [UPkk_id4_ptr]
fstp qword ptr [edx+esi*8] //pop UPkk[idx_4+idx_2]=UPk1k[idx_4+idx_2]+vec_b[idx_2]*Sd
fld qword ptr [ebx] //push vec_b[idx_2]
mov edx,dword ptr [vec_b_id1_ptr]
fld qword ptr [edx] //push vec_b[idx_2]
fmul qword ptr [eax]
fadd
fstp qword ptr [ebx]
jmp start_for11 //end of the loop
end_for11:
}
Many Thanks
Best Regards
CS
See MSDN about registers and that warning. They explain why the warning is produced: it forces the compiler to preserve value of EBX, which might be counter-productive to performance, the usual reason inline asm is used. Relevant quote:
In addition, by using EBX, ESI or EDI in inline assembly code, you
force the compiler to save and restore those registers in the function
prologue and epilogue.
To disable the warning, I think the syntax is
#pragma warning( disable : 4731 )
However, I'd try to use some other register instead, because the warning is there for a good reason, really, like most warnings.
In fact, Looking at your asm code, simply replace ebx With ecx, that should solve the problem.

CGShadingGetBounds() signature?

I am trying to get the signature of method CGShadingGetBounds()?
I tried, CG_EXTERN CGRect CGShadingGetBounds(CGShadingRef); but it does not seem to be a case.
Can someone help figure out the signature?
Below is the disassembly.
__text:000000000016BB76 public _CGShadingGetBounds
__text:000000000016BB76 _CGShadingGetBounds proc near ; CODE XREF: _log_LogShading+1B8p
__text:000000000016BB76 ; _dlr_DrawShading+1FEp ...
__text:000000000016BB76 push rbp
__text:000000000016BB77 mov rbp, rsp
__text:000000000016BB7A mov rax, rdi
__text:000000000016BB7D cmp byte ptr [rsi+28h], 0
__text:000000000016BB81 jz short loc_16BBAC
__text:000000000016BB83 movsd xmm0, qword ptr [rsi+30h]
__text:000000000016BB88 movsd qword ptr [rdi], xmm0
__text:000000000016BB8C movsd xmm0, qword ptr [rsi+38h]
__text:000000000016BB91 movsd qword ptr [rdi+8], xmm0
__text:000000000016BB96 movsd xmm0, qword ptr [rsi+40h]
__text:000000000016BB9B movsd qword ptr [rdi+10h], xmm0
__text:000000000016BBA0 movsd xmm0, qword ptr [rsi+48h]
__text:000000000016BBA5
__text:000000000016BBA5 loc_16BBA5: ; CODE XREF: _CGShadingGetBounds+5Ej
__text:000000000016BBA5 movsd qword ptr [rdi+18h], xmm0
__text:000000000016BBAA pop rbp
__text:000000000016BBAB retn
__text:000000000016BBAC ; ---------------------------------------------------------------------------
__text:000000000016BBAC
__text:000000000016BBAC loc_16BBAC: ; CODE XREF: _CGShadingGetBounds+Bj
__text:000000000016BBAC lea rcx, _CGRectInfinite
__text:000000000016BBB3 movsd xmm0, qword ptr [rcx]
__text:000000000016BBB7 movsd xmm1, qword ptr [rcx+8]
__text:000000000016BBBC movsd qword ptr [rdi], xmm0
__text:000000000016BBC0 movsd qword ptr [rdi+8], xmm1
__text:000000000016BBC5 movsd xmm0, qword ptr [rcx+10h]
__text:000000000016BBCA movsd qword ptr [rdi+10h], xmm0
__text:000000000016BBCF movsd xmm0, qword ptr [rcx+18h]
__text:000000000016BBD4 jmp short loc_16BBA5
__text:000000000016BBD4 _CGShadingGetBounds endp
My aim is to identify the bounds in which shading is going to happen.
I believe the signature you mentioned
CG_EXTERN CGRect CGShadingGetBounds(CGShadingRef);
is correct. For example if you try to reconstruct such function with a custom object, like this:
typedef struct
{
long a1, a2, a3, a4, a5;
char b6;
CGRect r;
} MyObj;
CGRect ReconstructFunc(MyObj *o)
{
if (o->b6) return o->r;
return CGRectNull;
}
of course, this does something different, but the "quick" path (where b6 is non-zero) is very very similar to the original function, in both assembly and in behaviour:
pushq %rbp
movq %rsp, %rbp
movq %rdi, %rax
cmpb $0, 40(%rsi)
je LBB0_2
movq 72(%rsi), %rcx
movq %rcx, 24(%rax)
movq 64(%rsi), %rcx
movq %rcx, 16(%rax)
movq 48(%rsi), %rcx
movq 56(%rsi), %rdx
movq %rdx, 8(%rax)
movq %rcx, (%rax)
popq %rbp
ret
... (continues)
This is basically the same at the assembly you posted. It also implies some "convention" obj-c and Mac GCC uses for compiling methods with CGRect structs. According to the x64 ABI parameters are passed in these registers: RDI, RSI, RDX, (and more). If you take a look at the first two, RDI and RSI, they clearly contains arguments, first one is a pointer to the output struct (CGRect), second one is the opaque struct (CGShadingReg).
Thus I believe that GCC on Mac translates this:
CGRect myrect = MyFuncReturningRect(param);
into this:
CGRect myrect;
MyFuncReturningRect(&myrect, param);
Anyway to sum it all up, I strongly believe your guessed signature is correct. If the function doesn't return values you expect, it is caused by some other factors (probably by the byte ptr [rsi+28h] value, which must be non-null to get some non-dummy information).

Find most significant DWORD in an DWORD array

I want to find the most significant DWORD which isn't equal to 0 in an DWORD array. The algorithm should be optimized for data sizes up to 128 byte.
I've made three different functions, which all returns the index of the specific DWORD.
unsigned long msb_msvc(long* dw, std::intptr_t n)
{
while( --n )
{
if( dw[n] )
break;
}
return n;
}
static inline unsigned long msb_386(long* dw, std::intptr_t n)
{
__asm
{
mov ecx, [dw]
mov eax, [n]
__loop: sub eax, 1
jz SHORT __exit
cmp DWORD PTR [ecx + eax * 4], 0
jz SHORT __loop
__exit:
}
}
static inline unsigned long msb_sse2(long* dw, std::intptr_t n)
{
__asm
{
mov ecx, [dw]
mov eax, [n]
test ecx, 0x0f
jnz SHORT __128_unaligned
__128_aligned:
cmp eax, 4
jb SHORT __64
sub eax, 4
movdqa xmm0, XMMWORD PTR [ecx + eax * 4]
pxor xmm1, xmm1
pcmpeqd xmm0, xmm1
pmovmskb edx, xmm0
not edx
and edx, 0xffff
jz SHORT __128_aligned
jmp SHORT __exit
__128_unaligned:
cmp eax, 4
jb SHORT __64
sub eax, 4
movdqu xmm0, XMMWORD PTR [ecx + eax * 4]
pxor xmm1, xmm1
pcmpeqd xmm0, xmm1
pmovmskb edx, xmm0
not edx
and edx, 0xffff
jz SHORT __128_unaligned
jmp SHORT __exit
__64:
cmp eax, 2
jb __32
sub eax, 2
movq mm0, MMWORD PTR [ecx + eax * 4]
pxor mm1, mm1
pcmpeqd mm0, mm1
pmovmskb edx, mm0
not edx
and edx, 0xff
emms
jz SHORT __64
jmp SHORT __exit
__32:
test eax, eax
jz SHORT __exit
xor eax, eax
jmp __leave ; retn
__exit:
bsr edx, edx
shr edx, 2
add eax, edx
__leave:
}
}
These function should be used, to preselect data which will be compared against each other. So, it needs to be performant.
Does anybody know a better algorithm?
I think you are just looking for the first non-zero word in a given array. I would definitely go with a simple loop written in C. If there's some reason why this is super performance critical, I would recommend you look in the larger context of your program and ask e.g. the question why you need to find the non-zero object from the array and why can't you know its location already.

Resources