update integer array elements atomically C++ - c++11

Given a shared array of integer counters, I am interested to know if a thread can atomically fetch and add an array element without locking the entire array?
Here's an illustration of working model that uses mutex to lock access to the entire array.
// thread-shared class members
std::mutex count_array_mutex_;
std::vector<int> counter_array_( 100ish );
// Thread critical section
int counter_index = ... // unpredictable index
int current_count;
{
std::lock_guard<std::mutex> lock(count_array_mutex_);
current_count = counter_array_[counter_index] ++;
}
// ... do stuff using current_count.
I'd like multiple threads to be able to fetch-add separate array elements simultaneously.
So far, in my research of std::atomic<int> I'm thrown off that constructing the atomic object also constructs the protected member. (And plenty of answers explaining why you can't make a std::vector<std::atomic<int> > )

C++20 / C++2a (or whatever you want to call it) will add std::atomic_ref<T> which lets you do atomic operations on an object that wasn't atomic<T> to start with.
It's not available yet as part of the standard library for most compilers, but there is a working implementation for gcc/clang/ICC / other compilers with GNU extensions.
Previously, atomic access to "plain" data was only available with some platform-specific functions like Microsoft's LONG InterlockedExchange(LONG volatile *Target, LONG Value); or GNU C / C++
type __atomic_add_fetch (type *ptr, type val, int memorder) (the same builtins that C++ libraries for GNU compilers use to implement std::atomic<T>.)
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0019r8.html includes some intro stuff about the motivation. CPUs can easily do this, compilers can already do this, and it's been annoying that C++ didn't expose this capability portably.
So instead of having to wrestle with C++ to get all the non-atomic allocation and init done in a constructor, you can just have every access create an atomic_ref to the element you want to access. (It's free to instantiate as a local, at least when it's lock-free, on any "normal" C++ implementations).
This will even let you do things like resize the std::vector<int> after you've ensured no other threads are accessing the vector elements or the vector control block itself. And then you can signal the other threads to resume.
It's not yet implemented in libstdc++ or libc++ for gcc/clang.
#include <vector>
#include <atomic>
#define Foo std // this atomic_ref.hpp puts it in namespace Foo, not std.
// current raw url for https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
#include "https://raw.githubusercontent.com/ORNL/cpp-proposals-pub/580934e3b8cf886e09accedbb25e8be2d83304ae/P0019/atomic_ref.hpp"
void inc_element(std::vector<int> &v, size_t idx)
{
v[idx]++;
}
void atomic_inc_element(std::vector<int> &v, size_t idx)
{
std::atomic_ref<int> elem(v[idx]);
static_assert(decltype(elem)::is_always_lock_free,
"performance is going to suck without lock-free atomic_ref<T>");
elem.fetch_add(1, std::memory_order_relaxed); // take your pick of memory order here
}
For x86-64, these compile exactly the way we'd hope with GCC,
using the sample implementation (for compilers implementing GNU extensions) linked in the C++ working-group proposal. https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
From the Godbolt compiler explorer with g++8.2 -Wall -O3 -std=gnu++2a:
inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi] # load the pointer member of std::vector
add DWORD PTR [rax+rsi*4], 1 # and index it as a memory destination
ret
atomic_inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi]
lock add DWORD PTR [rax+rsi*4], 1 # same but atomic RMW
ret
The atomic version is identical except it uses a lock prefix to make the read-modify-write atomic, by making sure no other core can read or write the cache line while this core is in the middle of atomically modifying it. Just in case you were curious how atomics work in asm.
Most non-x86 ISAs like AArch64 of course require a LL/SC retry loop to implement an atomic RMW, even with relaxed memory order.
The point here is that constructing / destructing the atomic_ref doesn't cost anything. Its member pointer fully optimizes away. So this is exactly as cheap as a vector<atomic<int>>, but without the headache.
As long as you're careful not to create data-race UB by resizing the vector, or accessing an element without going through atomic_ref. (It would potentially manifest as a use-after-free on many real implementations if std::vector reallocated the memory in parallel with another thread indexing into it, and of course you'd be atomically modifying a stale copy.)
This definitely gives you rope to hang yourself if you don't carefully respect the fact that the std::vector object itself is not atomic, and also that the compiler won't stop you from doing non-atomic access to the underlying v[idx] after other threads have started using it.

One way:
// Create.
std::vector<std::atomic<int>> v(100);
// Initialize.
for(auto& e : v)
e.store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % v.size();
int old = v[unpredictable_index].fetch_add(1, std::memory_order_relaxed);
Note that std::atomic<> copy-constructor is deleted, so that the vector cannot be resized and needs to be initialized with the final count of elements.
Since resize functionality of std::vector is lost, instead of std::vector you may as well use std::unique_ptr<std::atomic<int>[]>, e.g.:
// Create.
unsigned const N = 100;
std::unique_ptr<std::atomic<int>[]> p(new std::atomic<int>[N]);
// Initialize.
for(unsigned i = 0; i < N; ++i)
p[i].store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % N;
int old = p[unpredictable_index].fetch_add(1, std::memory_order_relaxed);

Related

How to call a c-function that takes a c-struct that contains pointers

From a GO program on a Raspberry PI I'm trying to call a function(Matlab function converted to C function) and the input to the function is a pointer to a struct and the struct contains pointer to a double(data) and a pointer to an int(size) and two int(allocatedSize, numDimensions). I have tried several ways but nothing has worked, when I have passed the compilation it usually throws a panic: runtime error: cgo argument has Go pointer to Go pointer when I run the program.
sumArray.c
/*sumArray.C*/
/* Include files */
#include "sumArray.h"
/* Function Definitions */
double sumArray(const emxArray_real_T *A1)
{
double S1;
int vlen;
int k;
vlen = A1->size[0];
if (A1->size[0] == 0) {
S1 = 0.0;
} else {
S1 = A1->data[0];
for (k = 2; k <= vlen; k++) {
S1 += A1->data[k - 1];
}
}
return S1;
}
sumArray.h
#ifndef SUMARRAY_H
#define SUMARRAY_H
/* Include files */
#include <stddef.h>
#include <stdlib.h>
#include "sumArray_types.h"
/* Function Declarations */
extern double sumArray(const emxArray_real_T *A1);
#endif
sumArray_types.h
#ifndef SUMARRAY_TYPES_H
#define SUMARRAY_TYPES_H
/* Include files */
/* Type Definitions */
#ifndef struct_emxArray_real_T
#define struct_emxArray_real_T
struct emxArray_real_T
{
double *data;
int *size;
int allocatedSize;
int numDimensions;
};
#endif /*struct_emxArray_real_T*/
#ifndef typedef_emxArray_real_T
#define typedef_emxArray_real_T
typedef struct emxArray_real_T emxArray_real_T;
#endif /*typedef_emxArray_real_T*/
#endif
/* End of code generation (sumArray_types.h) */
main.go
// #cgo CFLAGS: -g -Wall
// #include <stdlib.h>
// #include "sumArray.h"
import "C"
import (
"fmt"
)
func main() {
a1 := [4]C.Double{1,1,1,1}
a2 := [1]C.int{4}
cstruct := C.emxArray_real_T{data: &a1[0], size: &a2[0]}
cstructArr := [1]C.emxArray_real_T{cstruct}
y := C.sumArray(&cstructArr[0])
fmt.Print(float64(y))
}
With this example I get panic: runtime error: cgo argument has Go pointer to Go pointer when I run the program.
I do not how to make it work or if it is possible to make it work. I hope someone can help me or give some direction on how to solve this.
Too much for a comment, so here's the answer.
First, the original text:
A direct solution is to use C.malloc(4 * C.sizeof(C.double))to allocate the array of double-s. Note that you have to make sure to call C.free() on it when done. The same applies to the second array of a single int.
Now, your comment to the Mattanis' remark, which was, reformatted a bit:
thanks for giving some pointers. I tried with
a1 := [4]C.double{1,1,1,1}
sizeA1 := C.malloc(4 * C.sizeof_double)
cstruct := C.emxArray_real_T{
data: &a1[0],
size: (*C.int)(sizeA1)
}
y := C.sumArray(cstruct)
defer C.free(sizeA1)
but it gave me the same
answer as before cgo argument
has Go pointer to Go pointer when I
tried to run the program
You still seem to miss the crucial point. When you're using cgo, there are two disjoint "memory views":
"The Go memory" is everything allocated by the Go runtime powering your running process—on behalf of that process. This memory (most of the time, barring weird tricks) is known to the GC—which is a part of the runtime.
"The C memory" is memory allocated by the C code—typically by calling the libc's malloc()/realloc().
Now imagine a not-so-far-fetched scenario:
Your program runs, the C "side" gets initialized and
spawns its own thread (or threads), and holds handles on them.
Your Go "side" already uses multiple threads to run your goroutines.
You allocate some Go memory in your Go code and pass it
to the C side.
The C side passes the address of that memory to one or more of its own threads.
Your program continues to chug away, and so do the C-side threads—in parallel with your Go code.
As a result you have a reasonably classical scenario in which you get a super-simple situation for unsynchronized parallel memory access, which is a sure recepy for disaster on today's multi-core multi-socket hardware.
Also consider that Go is considerably a more higher-level programming language than C; at the bare minimum, it has automatic garbage collection, and notice that nothing in the Go spec specifies how exactly the GC must be specified.
This means, a particular implementation of Go (including the reference one—in the future) is free to allow its GC to move arbitrary objects in the memory¹, and this means updating every pointer pointing into the memory block in its original location to point to the same place in the block's new location—after it was moved.
With these considerations in mind, the Go devs postulated that in order to keep cgo-using programs future-proof², it is forbidden to pass to C any memory blocks which contain pointers to other Go memory blocks.
It's okay to pass Go memory blocks which contain pointers to C memory, though.
Going back to the example from your second comment,
you still allocate the array of 4 doubles, a1, in the Go memory.
Then the statement cstruct := C.emxArray_real_T{...} again allocates an instance of C.emxArray_real_T in the Go memory, and so after you initialize its data field with a pointer to Go memory (&a1[0]), and then pass its address to the C side, the runtime performs its dynamic checks before actually calling into the C side and crashes your program.
¹ This is typical behaviour for the so-called "generational" garbage collectors, for one example.
² That is, you recompile your program with a future version of the Go compiler of the same "major" release, and the program continues to work, unmodified.

How to instruct avr-gcc to optimize volatile variables?

Code for an interrupt service handler:
volatile unsigned char x = 0;
void interruptHandler() __attribute__ ((signal));
void interruptHandler() {
f();
g();
}
Calls:
void f() { x ++; } // could be more complex, could also be in a different file
void g() { x ++; } // as `f()`, this is just a very simple example
Because x is a volatile variable, it is read and written every time it is used. The body of the interrupt handler compiles to (avr-gcc -g -c -Wa,-alh -mmcu=atmega328p -Ofast file.c):
lds r24,x
subi r24,lo8(-(1))
sts x,r24
lds r24,x
subi r24,lo8(-(1))
sts x,r24
Now I can manually inline the functions and employ a temporary variable:
unsigned char y = x;
y ++;
y ++;
x = y;
Or I can just write:
x += 2;
Both examples compile to the much more efficient:
lds r24,x
subi r24,lo8(-(2))
sts x,r24
Is it possible to tell avr-gcc to optimize access to volatile variables inside of interruptHandler, i.e. to do my manual optimization automatically?
After all, while interruptHandler is running, global interrupts are disabled, and it is impossible for x to change. I prefer not having to hand optimize code, thereby possibly creating duplicate code (if f() and g() are needed elsewhere) and introducing errors.
Is it possible to tell avr-gcc to optimize access to volatile variables inside of interruptHandler, i.e. to do my manual optimization automatically?
No, that is not possible in the C language.
After all, while interruptHandler is running, global interrupts are disabled
The compiler does not know this - and you could simply put an sei into the handler to turn them back on.
Also note that hardware registers are declared volatile, too. Some of these - like the UART data register - have side effects even when read. The compiler must not remove any reads or writes for these.
If you declare a variable to be volatile, then all accesses to it are volatile - the compiler will read and write it exactly as many times as the source code says, without combining them or doing similar optimisations.
So if you want combining optimisations, declare the variable without the "volatile" - then you will get what you need inside the interrupt code.
And then from outside the interrupt code, you can force volatile accesses using something like this macro:
#define volatileAccess(v) *((volatile typeof((v)) *) &(v))
Use "volatileAccess(x)" rather than "x" outside the interrupt code.
Just don't forget that "volatile" does not mean "atomic" !

C++ stateful allocator de-allocate issues

This issue is my misunderstanding of how the standard is using my custom allocator. I have a stateful allocator that keeps a vector of allocated blocks. This vector is pushed into when allocating and searched through during de-allocation.
From my debugging it appears that different instances of my object (this*'s differ) are being called on de-allocation. An example may be that MyAllocator (this* = 1) is called to allocate 20 bytes, then some time later MyAllocator (this* = 2) is called to de-allocate the 20 bytes allocated earlier. Abviously the vector in MyAllocator (this* = 2) doesn't contain the 20 byte block allocated by the other allocator so it fails to de-allocate. My understanding was that C++11 allows stateful allocators, what's going on and how do i fix this?
I already have my operator == set to only return true when this == &rhs
pseudo-code:
template<typename T>
class MyAllocator
{
ptr allocate(int n)
{
...make a block of size sizeof(T) * n
blocks.push_back(block);
return (ptr)block.start;
}
deallocate(ptr start, int n)
{
/*This fails because the the block array is not the
same and so doesn't find the block it wants*/
std::erase(std::remove_if(blocks.begin,blocks.end, []()
{
return block.start >= (uint64_t)ptr && block.end <= ((uint64_t)ptr + sizeof(T)*n);
}), blocks.end);
}
bool operator==(const MyAllocator& rhs)
{
//my attempt to make sure internal states are same
return this == &rhs;
}
private:
std::vector<MemoryBlocks> blocks;
}
Im using this allocator for an std::vector, on gcc. So as far as i know no weird rebind stuff is going on
As #Igor mentioned, allocators must be copyable. Importantly though they must share their state between copies, even AFTER they have been copied from. In this case the fix was easy, i made the blocks vector a shared_ptr as suggested and then now on copy all the updates to that vector occur to the same vector, since they all point to the same thing.

atomic_inc and atomic_xchg in gcc assembly

I have written the following user-level code snippet to test two sub functions, atomic inc and xchg (refer to Linux code).
What I need is just try to perform operations on 32-bit integer, and that's why I explicitly use int32_t.
I assume global_counter will be raced by different threads, while tmp_counter is fine.
#include <stdio.h>
#include <stdint.h>
int32_t global_counter = 10;
/* Increment the value pointed by ptr */
void atomic_inc(int32_t *ptr)
{
__asm__("incl %0;\n"
: "+m"(*ptr));
}
/*
* Atomically exchange the val with *ptr.
* Return the value previously stored in *ptr before the exchange
*/
int32_t atomic_xchg(uint32_t *ptr, uint32_t val)
{
uint32_t tmp = val;
__asm__(
"xchgl %0, %1;\n"
: "=r"(tmp), "+m"(*ptr)
: "0"(tmp)
:"memory");
return tmp;
}
int main()
{
int32_t tmp_counter = 0;
printf("Init global=%d, tmp=%d\n", global_counter, tmp_counter);
atomic_inc(&tmp_counter);
atomic_inc(&global_counter);
printf("After inc, global=%d, tmp=%d\n", global_counter, tmp_counter);
tmp_counter = atomic_xchg(&global_counter, tmp_counter);
printf("After xchg, global=%d, tmp=%d\n", global_counter, tmp_counter);
return 0;
}
My 2 questions are:
Are these two subfunctions written properly?
Will this behave the same when I compile this on 32-bit or
64-bit platform? For example, could the pointer address have a different
length. or could incl and xchgl will conflict with the operand?
My understanding of this question is below, please correct me if I'm wrong.
All the read-modify-write instructions (ex: incl, add, xchg) need a lock prefix. The lock instruction is to lock the memory accessed by other CPUs by asserting LOCK# signal on the memory bus.
The __xchg function in Linux kernel implies no "lock" prefix because xchg always implies lock anyway. http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/cmpxchg_64.h#L15
However, the incl used in atomic_inc does not have this assumption so a lock_prefix is needed.
http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/atomic.h#L105
btw, I think you need to copy the *ptr to a volatile variable to avoid gcc optimization.
William

What useful things can I do with Visual C++ Debug CRT allocation hooks except finding reproduceable memory leaks?

Visual C++ debug runtime library features so-called allocation hooks. Works this way: you define a callback and call _CrtSetAllocHook() to set that callback. Now every time a memory allocation/deallocation/reallocation is done CRT calls that callback and passes a handful of parameters.
I successfully used an allocation hook to find a reproduceable memory leak - basically CRT reported that there was an unfreed block with allocation number N (N was the same on every program run) at program termination and so I wrote the following in my hook:
int MyAllocHook( int allocType, void* userData, size_t size, int blockType,
long requestNumber, const unsigned char* filename, int lineNumber)
{
if( requestNumber == TheNumberReported ) {
Sleep( 0 );// a line to put breakpoint on
}
return TRUE;
}
since the leak was reported with the very same allocation number every time I could just put a breakpoint inside the if-statement and wait until it was hit and then inspect the call stack.
What other useful things can I do using allocation hooks?
You could also use it to find unreproducible memory leaks:
Make a data structure where you map the allocated pointer to additional information
In the allocation hook you could query the current call stack (StackWalk function) and store the call stack in the data structure
In the de-allocation hook, remove the call stack information for that allocation
At the end of your application, loop over the data structure and report all call stacks. These are the places where memory was allocated but not freed.
The value "requestNumber" is not passed on to the function when deallocating (MS VS 2008). Without this number you cannot keep track of your allocation. However, you can peek into the heap header and extract that value from there:
Note: This is compiler dependent and may change without notice/ warning by the compiler.
// This struct is a copy of the heap header used by MS VS 2008.
// This information is prepending each allocated memory object in debug mode.
struct MsVS_CrtMemBlockHeader {
MsVS_CrtMemBlockHeader * _next;
MsVS_CrtMemBlockHeader * _prev;
char * _szFilename;
int _nLine;
int _nDataSize;
int _nBlockUse;
long _lRequest;
char _gap[4];
};
int MyAllocHook(..) { // same as in question
if(nAllocType == _HOOK_FREE) {
// requestNumber isn't passed on to the Hook on free.
// However in the heap header this value is stored.
size_t headerSize = sizeof(MsVS_CrtMemBlockHeader);
MsVS_CrtMemBlockHeader* pHead;
size_t ptr = (size_t) pvData - headerSize;
pHead = (MsVS_CrtMemBlockHeader*) (ptr);
long requestNumber = pHead->_lRequest;
// Do what you like to keep track of this allocation.
}
}
You could keep record of every allocation request then remove it once the deallocation is invoked, for instance: This could help you tracking memory leak problems that are way much worse than this to track down.
Just the first idea that comes to my mind...

Resources