asio implicit strand and data synchronization - thread-safety

When I read asio source code, I am curious about how asio making data synchronized between threads even a implicit strand was made. These are code in asio:
io_service::run
mutex::scoped_lock lock(mutex_);
std::size_t n = 0;
for (; do_run_one(lock, this_thread, ec); lock.lock())
if (n != (std::numeric_limits<std::size_t>::max)())
++n;
return n;
io_service::do_run_one
while (!stopped_)
{
if (!op_queue_.empty())
{
// Prepare to execute first handler from queue.
operation* o = op_queue_.front();
op_queue_.pop();
bool more_handlers = (!op_queue_.empty());
if (o == &task_operation_)
{
task_interrupted_ = more_handlers;
if (more_handlers && !one_thread_)
{
if (!wake_one_idle_thread_and_unlock(lock))
lock.unlock();
}
else
lock.unlock();
task_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Run the task. May throw an exception. Only block if the operation
// queue is empty and we're not polling, otherwise we want to return
// as soon as possible.
task_->run(!more_handlers, this_thread.private_op_queue);
}
else
{
std::size_t task_result = o->task_result_;
if (more_handlers && !one_thread_)
wake_one_thread_and_unlock(lock);
else
lock.unlock();
// Ensure the count of outstanding work is decremented on block exit.
work_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Complete the operation. May throw an exception. Deletes the object.
o->complete(*this, ec, task_result);
return 1;
}
}
in its do_run_one, the unlock of mutex are all before execute handler. If there is a implicit strand, handler will not executed concurrent, but the problem is: thread A run a handler which modify data, and thread B run next handler which read the data which had been modified by thread A. Without the protect of mutex, how thread B seen the changes of data made by thread A? The mutex unlocking ahead of handler execution doesn't make a happen-before relationship between threads access the data which handler accessed.
When I go further, the handler execution use a thing called fenced_block:
completion_handler* h(static_cast<completion_handler*>(base));
ptr p = { boost::addressof(h->handler_), h, h };
BOOST_ASIO_HANDLER_COMPLETION((h));
// Make a copy of the handler so that the memory can be deallocated before
// the upcall is made. Even if we're not about to make an upcall, a
// sub-object of the handler may be the true owner of the memory associated
// with the handler. Consequently, a local copy of the handler is required
// to ensure that any owning sub-object remains valid until after we have
// deallocated the memory here.
Handler handler(BOOST_ASIO_MOVE_CAST(Handler)(h->handler_));
p.h = boost::addressof(handler);
p.reset();
// Make the upcall if required.
if (owner)
{
fenced_block b(fenced_block::half);
BOOST_ASIO_HANDLER_INVOCATION_BEGIN(());
boost_asio_handler_invoke_helpers::invoke(handler, handler);
BOOST_ASIO_HANDLER_INVOCATION_END;
}
what is this? I know fence seems a sync primitive which supported by C++11 but this fence is totally writen by asio itself. Does this fenced_block help to do the job of data synchronization?
UPDATED
After I google and read this and this, asio indeed use memory fence primitive to synchronize data in threads, that is more faster than unlock till the handler execute complete(speed difference on x86). In fact Java volatile keyword is implemented by insert memory barrier after write & before read this variable to make happen-before relationship.
If someone could simply describe asio memory fence implemenation or add something I missed or misunderstand, I will accept it.

Before the operation invokes the user handler, Boost.Asio uses a memory fence to provide the appropriate memory reordering without forcing mutual execution of handler execution. Thus, thread B would observe changes to memory that occurred within the context of thread A.
C++03 did not specify requirements for memory visibility with regards to multi-threaded execution. However, C++11 defines these requirements in § 1.10 Multi-threaded executions and data races, as well as the Atomic operations and Thread support library sections. Boost and C++11 mutexes do perform the appropriate memory reordering. For other implementations, it is worth checking the mutex library's documentation to verify memory reordering occurs.
Boost.Asio memory fences are an implementation detail, and thus always subject to change. Boost.Asio abstracts itself from the architecture/compiler specific implementations through a series of conditional defines within asio/detail/fenced_block.hpp where only a single memory barrier implementation is included. The underlying implementation is contained within a class, for which a fenced_block alias is created via a typedef.
Here is a relevant excerpt:
#elif defined(__GNUC__) && (defined(__hppa) || defined(__hppa__))
# include "asio/detail/gcc_hppa_fenced_block.hpp"
#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
# include "asio/detail/gcc_x86_fenced_block.hpp"
#elif ...
...
namespace asio {
namespace detail {
...
#elif defined(__GNUC__) && (defined(__hppa) || defined(__hppa__))
typedef gcc_hppa_fenced_block fenced_block;
#elif defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
typedef gcc_x86_fenced_block fenced_block;
#elif ...
...
} // namespace detail
} // namespace asio
The implementations of the the memory barriers are specific to the architecture and compilers. Boost.Asio has a family of asio/detail/*_fenced_blocked.hpp header files. For example, the win_fenced_block uses InterlockedExchange for Borland; otherwise it uses the xchg assembly instruction, which has an implicit lock prefix when used with a memory address. For gcc_x86_fenced_block, Boost.Asio uses the memory assembly instruction.
If you find yourself needing to use a fence, then consider the Boost.Atomic library. Introduced in Boost 1.53, Boost.Atomic provides an implementation of thread and signal fences based the C++11 standard. Boost.Asio has been using its own implementation of memory fences prior to the Boost.Atomic being added to Boost. Also, the Boost.Asio fences are scoped based. fenced_block will perform an acquire in its constructor, and a release in its destructor.

Related

update integer array elements atomically C++

Given a shared array of integer counters, I am interested to know if a thread can atomically fetch and add an array element without locking the entire array?
Here's an illustration of working model that uses mutex to lock access to the entire array.
// thread-shared class members
std::mutex count_array_mutex_;
std::vector<int> counter_array_( 100ish );
// Thread critical section
int counter_index = ... // unpredictable index
int current_count;
{
std::lock_guard<std::mutex> lock(count_array_mutex_);
current_count = counter_array_[counter_index] ++;
}
// ... do stuff using current_count.
I'd like multiple threads to be able to fetch-add separate array elements simultaneously.
So far, in my research of std::atomic<int> I'm thrown off that constructing the atomic object also constructs the protected member. (And plenty of answers explaining why you can't make a std::vector<std::atomic<int> > )
C++20 / C++2a (or whatever you want to call it) will add std::atomic_ref<T> which lets you do atomic operations on an object that wasn't atomic<T> to start with.
It's not available yet as part of the standard library for most compilers, but there is a working implementation for gcc/clang/ICC / other compilers with GNU extensions.
Previously, atomic access to "plain" data was only available with some platform-specific functions like Microsoft's LONG InterlockedExchange(LONG volatile *Target, LONG Value); or GNU C / C++
type __atomic_add_fetch (type *ptr, type val, int memorder) (the same builtins that C++ libraries for GNU compilers use to implement std::atomic<T>.)
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0019r8.html includes some intro stuff about the motivation. CPUs can easily do this, compilers can already do this, and it's been annoying that C++ didn't expose this capability portably.
So instead of having to wrestle with C++ to get all the non-atomic allocation and init done in a constructor, you can just have every access create an atomic_ref to the element you want to access. (It's free to instantiate as a local, at least when it's lock-free, on any "normal" C++ implementations).
This will even let you do things like resize the std::vector<int> after you've ensured no other threads are accessing the vector elements or the vector control block itself. And then you can signal the other threads to resume.
It's not yet implemented in libstdc++ or libc++ for gcc/clang.
#include <vector>
#include <atomic>
#define Foo std // this atomic_ref.hpp puts it in namespace Foo, not std.
// current raw url for https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
#include "https://raw.githubusercontent.com/ORNL/cpp-proposals-pub/580934e3b8cf886e09accedbb25e8be2d83304ae/P0019/atomic_ref.hpp"
void inc_element(std::vector<int> &v, size_t idx)
{
v[idx]++;
}
void atomic_inc_element(std::vector<int> &v, size_t idx)
{
std::atomic_ref<int> elem(v[idx]);
static_assert(decltype(elem)::is_always_lock_free,
"performance is going to suck without lock-free atomic_ref<T>");
elem.fetch_add(1, std::memory_order_relaxed); // take your pick of memory order here
}
For x86-64, these compile exactly the way we'd hope with GCC,
using the sample implementation (for compilers implementing GNU extensions) linked in the C++ working-group proposal. https://github.com/ORNL/cpp-proposals-pub/blob/master/P0019/atomic_ref.hpp
From the Godbolt compiler explorer with g++8.2 -Wall -O3 -std=gnu++2a:
inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi] # load the pointer member of std::vector
add DWORD PTR [rax+rsi*4], 1 # and index it as a memory destination
ret
atomic_inc_element(std::vector<int, std::allocator<int> >&, unsigned long):
mov rax, QWORD PTR [rdi]
lock add DWORD PTR [rax+rsi*4], 1 # same but atomic RMW
ret
The atomic version is identical except it uses a lock prefix to make the read-modify-write atomic, by making sure no other core can read or write the cache line while this core is in the middle of atomically modifying it. Just in case you were curious how atomics work in asm.
Most non-x86 ISAs like AArch64 of course require a LL/SC retry loop to implement an atomic RMW, even with relaxed memory order.
The point here is that constructing / destructing the atomic_ref doesn't cost anything. Its member pointer fully optimizes away. So this is exactly as cheap as a vector<atomic<int>>, but without the headache.
As long as you're careful not to create data-race UB by resizing the vector, or accessing an element without going through atomic_ref. (It would potentially manifest as a use-after-free on many real implementations if std::vector reallocated the memory in parallel with another thread indexing into it, and of course you'd be atomically modifying a stale copy.)
This definitely gives you rope to hang yourself if you don't carefully respect the fact that the std::vector object itself is not atomic, and also that the compiler won't stop you from doing non-atomic access to the underlying v[idx] after other threads have started using it.
One way:
// Create.
std::vector<std::atomic<int>> v(100);
// Initialize.
for(auto& e : v)
e.store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % v.size();
int old = v[unpredictable_index].fetch_add(1, std::memory_order_relaxed);
Note that std::atomic<> copy-constructor is deleted, so that the vector cannot be resized and needs to be initialized with the final count of elements.
Since resize functionality of std::vector is lost, instead of std::vector you may as well use std::unique_ptr<std::atomic<int>[]>, e.g.:
// Create.
unsigned const N = 100;
std::unique_ptr<std::atomic<int>[]> p(new std::atomic<int>[N]);
// Initialize.
for(unsigned i = 0; i < N; ++i)
p[i].store(0, std::memory_order_relaxed);
// Atomically increment.
auto unpredictable_index = std::rand() % N;
int old = p[unpredictable_index].fetch_add(1, std::memory_order_relaxed);

How can I tell if a static object has been destroyed in C++11

In the C++11 specification, basic.start.term 1 states:
If the completion of the constructor or dynamic initialization of an object with static storage
duration is sequenced before that of another, the completion of the destructor of the second is sequenced
before the initiation of the destructor of the first. [ Note: This definition permits concurrent destruction.
—end note ]
In C++03, my destructors were ordered. The order may not be specified, but they were ordered. This was very useful for static objects that had to register themselves. There was no concept of multithreading in the spec, so the spec had no concept of unordered destructors. The compilers that implemented multithreading that I know of did destruction in a single-threaded environment.
a.cpp:
struct A
{
A()
: mRegistration(0)
{ }
~A()
{
if (mRegistration)
tryUnregisterObject(mRegistration);
}
void registerNow()
{
mRegistration = registerObject(this);
}
};
A myA;
b.cpp:
class Registrar
{
public:
Registrar()
{
isAlive = true;
}
~Registrar()
{
isAlive = false;
}
...
};
bool isAlive = false; // constant initialization
static Registrar& registrar()
{
static Registrar instance;
return instance;
}
int registerObject(void* obj)
{
registar().register(obj);
}
void tryUnregisterObject(void* obj)
{
if (isAlive) {
registrar().unregister(obj);
} else {
// do nothing. registrar was destroyed
}
}
In this example, I can't guarantee the order of destruction for myA and Registrar because they're in different compilation units. However, I can at least detect what order they occurred in and act accordingly.
In C++11, this approach creates a data race around the isAlive variable. This can be solved during construction because I can create a synchronization object like a mutex to protect it when I first need it. However, in the destruction case, I may have to check isAlive after my mutex has been destroyed!
Is there a way to get around this in C++11? I feel like I need a synchronization primitive to solve the problem, but everything I've tried leads to the primitive getting destroyed before its done protecting what I need to protect. If I were to use the Windows or PThreads threading primitives, I could simply elect to not call the destructor and let the OS clean up after me. However, C++ objects clean themselves up.
[basic.start.init]/2 If a program starts a thread (30.3), the subsequent initialization of a variable is unsequenced with respect to the initialization of a variable defined in a different translation unit. Otherwise, the initialization of a variable is indeterminately sequenced with respect to the initialization of a variable defined in a different translation unit. If a program starts a thread, the subsequent unordered initialization of a variable is unsequenced with respect to every other dynamic initialization. Otherwise, the unordered initialization of a variable is indeterminately sequenced with respect to every other dynamic initialization.
(Emphasis mine.) Therefore, as long as you don't start threads in any of your static objects' constructors, you have the same guarantee as in earlier versions of the standard.

Handling data on wire using unique_ptr

When receiving data on wire and sending it to upper applications, normally, in C style, we have a struct for example with a void*:
struct SData{
//... len, size, version, msg type, ...
void* payload;
}
Later in the code, after error checking and mallocating, ..., we can do something as:
if(msgType == type1){
struct SType1* ptr = (struct SType1*) SData->payload;
}
In C++, an attempt to use unique_ptr fails in the following snippet:
struct SData{
// .. len, size, version, msg type, ...
std::unique_ptr<void> payload;
}
But as you know, this will cause:
error: static assertion failed: can't delete pointer to incomplete type
Is there a way to use smart pointers to handle this?
One solution I found is here:
Should std::unique_ptr<void> be permitted
Which requires creating a custom deleter:
void HandleDeleter(HANDLE h)
{
if (h) CloseHandle(h);
}
using
UniHandle = unique_ptr<void, function<void(HANDLE)>>;
This will require significantly more additional code (compared to the simple unsafe C Style), since for each type of payload there has to be some logic added.
This will require significantly more additional code (compared to the simple unsafe C Style), since for each type of payload there has to be some logic added.
The additional complexity is only calling the added destructors. You could use a function pointer instead of std::function since no closure state should ever be used.
If you don't want destructors, but only to add RAII to the C idiom, then use a custom deleter which simply does operator delete or std::free.

Almost lockless producer consumer

I have producer consumer problem to solve with slight modification - there are many parallel producers but only one consumer in one parallel thread. When producer has no place in buffer then it simply ignore element (without waiting on consumer). I have writtent some C pseudocode:
struct Element
{
ULONG content;
volatile LONG bNew;
}
ULONG max_count = 10;
Element buffer* = calloc(max_count, sizeof(Element));
volatile LONG producer_idx = 0;
LONG consumer_idx = 0;
EVENT NotEmpty;
BOOLEAN produce(ULONG content)
{
LONG idx = InterlockedIncrement(&consumer_idx) % max_count;
if(buffer[idx].bNew)
return FALSE;
buffer[idx].content = content;
buffer[idx].bNew = TRUE;
SetEvent(NotEmpty);
return TRUE;
}
void consume_thread()
{
while(TRUE)
{
Wait(NotEmpty);
while(buffer[consumer_idx].bNew)
{
ULONG content = buffer[consumer_idx].content;
InterlockedExchange(&buffer[consumer_idx].bNew, FALSE);
//Simple mechanism for preventing producer_idx overflow
LONG tmp = producer_idx;
InterlockedCompareExchange(&producer_idx, tmp%maxcount, tmp);
consumer_idx = (consumer_idx+1)%max_count;
doSth(content);
}
}
}
I am no 100% sure that this code is correct. Can you see any problems that could occur? Or maybe this code could be written in better way?
Don't use global variable to accomplish your goal, especially in multithreaded application!!! Use Semaphore instead and don't do Lock but TryLock instead. If TryLock fails it means there's no room for another element, so you can skip it.
Here you can find something to read about semaphores in WinAPI because you would probably use it:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx
You can achieve TryLock functionality by passing 0 as timeout to WaitForSingleObject function.
Please, read this: http://en.wikipedia.org/wiki/Memory_barrier
The C and C++ standards do not address multiple threads (or multiple
processors), and as such, the usefulness of volatile depends on the
compiler and hardware. Although volatile guarantees that the volatile
reads and volatile writes will happen in the exact order specified in
the source code, the compiler may generate code (or the CPU may
re-order execution) such that a volatile read or write is reordered
with regard to non-volatile reads or writes, thus limiting its
usefulness as an inter-thread flag or mutex. Moreover, it is not
guaranteed that volatile reads and writes will be seen in the same
order by other processors due to caching, cache coherence protocol and
relaxed memory ordering, meaning volatile variables alone may not even
work as inter-thread flags or mutexes.
So in common case just volatile won't work for C.
But it could work for some specific compilers/hardware and other languages (Java 5, for example).
See also Is function call a memory barrier?

What useful things can I do with Visual C++ Debug CRT allocation hooks except finding reproduceable memory leaks?

Visual C++ debug runtime library features so-called allocation hooks. Works this way: you define a callback and call _CrtSetAllocHook() to set that callback. Now every time a memory allocation/deallocation/reallocation is done CRT calls that callback and passes a handful of parameters.
I successfully used an allocation hook to find a reproduceable memory leak - basically CRT reported that there was an unfreed block with allocation number N (N was the same on every program run) at program termination and so I wrote the following in my hook:
int MyAllocHook( int allocType, void* userData, size_t size, int blockType,
long requestNumber, const unsigned char* filename, int lineNumber)
{
if( requestNumber == TheNumberReported ) {
Sleep( 0 );// a line to put breakpoint on
}
return TRUE;
}
since the leak was reported with the very same allocation number every time I could just put a breakpoint inside the if-statement and wait until it was hit and then inspect the call stack.
What other useful things can I do using allocation hooks?
You could also use it to find unreproducible memory leaks:
Make a data structure where you map the allocated pointer to additional information
In the allocation hook you could query the current call stack (StackWalk function) and store the call stack in the data structure
In the de-allocation hook, remove the call stack information for that allocation
At the end of your application, loop over the data structure and report all call stacks. These are the places where memory was allocated but not freed.
The value "requestNumber" is not passed on to the function when deallocating (MS VS 2008). Without this number you cannot keep track of your allocation. However, you can peek into the heap header and extract that value from there:
Note: This is compiler dependent and may change without notice/ warning by the compiler.
// This struct is a copy of the heap header used by MS VS 2008.
// This information is prepending each allocated memory object in debug mode.
struct MsVS_CrtMemBlockHeader {
MsVS_CrtMemBlockHeader * _next;
MsVS_CrtMemBlockHeader * _prev;
char * _szFilename;
int _nLine;
int _nDataSize;
int _nBlockUse;
long _lRequest;
char _gap[4];
};
int MyAllocHook(..) { // same as in question
if(nAllocType == _HOOK_FREE) {
// requestNumber isn't passed on to the Hook on free.
// However in the heap header this value is stored.
size_t headerSize = sizeof(MsVS_CrtMemBlockHeader);
MsVS_CrtMemBlockHeader* pHead;
size_t ptr = (size_t) pvData - headerSize;
pHead = (MsVS_CrtMemBlockHeader*) (ptr);
long requestNumber = pHead->_lRequest;
// Do what you like to keep track of this allocation.
}
}
You could keep record of every allocation request then remove it once the deallocation is invoked, for instance: This could help you tracking memory leak problems that are way much worse than this to track down.
Just the first idea that comes to my mind...

Resources