I'm modifying some existing open source library and there is a struct (say named as Node) containing bit-fields, e.g.
struct Node {
std::atomic<uint32_t> size:30;
std::atomic<uint32_t> isnull:1;
};
To fit my needs, these fields need to be atomic so I was expecting to use std::atomic for this and faced compile time error:
bit-field 'size' has non-integral type 'std::atomic<uint32_t>'
According to documentation, there is a restricted set of types which can be used for std::atomic
Can anyone advise/have idea on how to get functionality of atomic fields with the minimum impact to the existing source code?
Thanks in advance!
I used an unsigned short as an example below.
This is less ideal, but you could sacrifice 8 bits and insert a std::atomic_flag in the bit field with a union. Unfortunately, std::atomic_flag type is a std::atomic_bool type.
This structure can be spin locked manually every time you access it. However, the code should have minimal performance degradation (unlike creating, locking, unlocking, destroying with a std::mutex and std::unique_lock).
This code may waste about 10-30 clock cycles to enable low cost multi-threading.
PS. Make sure the reserved 8 bits below are not messed up by the endian structure of the processor. You may have to define at the end for big-endian processors. I only tested this code on an Intel CPU (always little-endian).
#include <iostream>
#include <atomic>
#include <thread>
union Data
{
std::atomic_flag access = ATOMIC_FLAG_INIT; // one byte
struct
{
typedef unsigned short ushort;
ushort reserved : 8;
ushort count : 4;
ushort ready : 1;
ushort unused : 3;
} bits;
};
class SpinLock
{
public:
inline SpinLock(std::atomic_flag &access, bool locked=true)
: mAccess(access)
{
if(locked) lock();
}
inline ~SpinLock()
{
unlock();
}
inline void lock()
{
while (mAccess.test_and_set(std::memory_order_acquire))
{
}
}
// each attempt will take about 10-30 clock cycles
inline bool try_lock(unsigned int attempts=0)
{
while(mAccess.test_and_set(std::memory_order_acquire))
{
if (! attempts) return false;
-- attempts;
}
return true;
}
inline void unlock()
{
mAccess.clear(std::memory_order_release);
}
private:
std::atomic_flag &mAccess;
};
void aFn(int &i, Data &d)
{
SpinLock lock(d.access, false);
// manually locking/unlocking can be tighter
lock.lock();
if (d.bits.ready)
{
++d.bits.count;
}
d.bits.ready ^= true; // alternate each time
lock.unlock();
}
int main(void)
{
Data f;
f.bits.count = 0;
f.bits.ready = true;
std::thread *p[8];
for (int i = 0; i < 8; ++ i)
{
p[i] = new std::thread([&f] (int i) { aFn(i, f); }, i);
}
for (int i = 0; i < 8; ++i)
{
p[i]->join();
delete p[i];
}
std::cout << "size: " << sizeof(f) << std::endl;
std::cout << "count: " << f.bits.count << std::endl;
}
The result is as expected...
size: 2
count: 4
Related
I have implemented a custom storage interface in libtorrent as described in the help section here.
The storage_interface is working fine, although I can't figure out why readv is only called randomly while downloading a torrent. From my view the overriden virtual function readv should get called each time I call handle->read_piece in piece_finished_alert. It should read the piece for read_piece_alert?
The buffer is provided in read_piece_alert without getting notified in readv.
So the question is why it is called only randomly and why it's not called on a read_piece() call? Is my storage_interface maybe wrong?
The code looks like this:
struct temp_storage : storage_interface
{
virtual int readv(file::iovec_t const* bufs, int num_bufs
, int piece, int offset, int flags, storage_error& ec)
{
// Only called on random pieces while downloading a larger torrent
std::map<int, std::vector<char> >::const_iterator i = m_file_data.find(piece);
if (i == m_file_data.end()) return 0;
int available = i->second.size() - offset;
if (available <= 0) return 0;
if (available > num_bufs) available = num_bufs;
memcpy(&bufs, &i->second[offset], available);
return available;
}
virtual int writev(file::iovec_t const* bufs, int num_bufs
, int piece, int offset, int flags, storage_error& ec)
{
std::vector<char>& data = m_file_data[piece];
if (data.size() < offset + num_bufs) data.resize(offset + num_bufs);
std::memcpy(&data[offset], bufs, num_bufs);
return num_bufs;
}
virtual bool has_any_file(storage_error& ec) { return false; }
virtual ...
virtual ...
}
Intialized with
storage_interface* temp_storage_constructor(storage_params const& params)
{
printf("NEW INTERFACE\n");
return new temp_storage(*params.files);
}
p.storage = &temp_storage_constructor;
The function below sets up alerts and invokes read_piece on each completed piece.
while(true) {
std::vector<alert*> alerts;
s.pop_alerts(&alerts);
for (alert* i : alerts)
{
switch (i->type()) {
case read_piece_alert::alert_type:
{
read_piece_alert* p = (read_piece_alert*)i;
if (p->ec) {
// read_piece failed
break;
}
// piece buffer, size is provided without readv
// notification after invoking read_piece in piece_finished_alert
break;
}
case piece_finished_alert::alert_type: {
piece_finished_alert* p = (piece_finished_alert*)i;
p->handle.read_piece(p->piece_index);
// Once the piece is finished, we read it to obtain the buffer in read_piece_alert.
break;
}
default:
break;
}
}
Sleep(100);
}
I will answer my own question. As Arvid said in the comments: readv was not invoked because of caching. Setting settings_pack::use_read_cache to false will invoke readv always.
In an attempt to make a more usable version of the code I wrote for an answer to another question, I used a lambda function to process an individual unit. This is a work in progress. I've got the "client" syntax looking pretty nice:
// for loop split into 4 threads, calling doThing for each index
parloop(4, 0, 100000000, [](int i) { doThing(i); });
However, I have an issue. Whenever I call the saved lambda, it takes up a ton of CPU time. doThing itself is an empty stub. If I just comment out the internal call to the lambda, then the speed returns to normal (4 times speedup for 4 threads). I'm using std::function to save the reference to the lambda.
My question is - Is there some better way that the stl library internally manages lambdas for large sets of data, that I haven't come across?
struct parloop
{
public:
std::vector<std::thread> myThreads;
int numThreads, rangeStart, rangeEnd;
std::function<void (int)> lambda;
parloop(int _numThreads, int _rangeStart, int _rangeEnd, std::function<void(int)> _lambda) //
: numThreads(_numThreads), rangeStart(_rangeStart), rangeEnd(_rangeEnd), lambda(_lambda) //
{
init();
exit();
}
void init()
{
myThreads.resize(numThreads);
for (int i = 0; i < numThreads; ++i)
{
myThreads[i] = std::thread(myThreadFunction, this, chunkStart(i), chunkEnd(i));
}
}
void exit()
{
for (int i = 0; i < numThreads; ++i)
{
myThreads[i].join();
}
}
int rangeJump()
{
return ceil(float(rangeEnd - rangeStart) / float(numThreads));
}
int chunkStart(int i)
{
return rangeJump() * i;
}
int chunkEnd(int i)
{
return std::min(rangeJump() * (i + 1) - 1, rangeEnd);
}
static void myThreadFunction(parloop *self, int start, int end) //
{
std::function<void(int)> lambda = self->lambda;
// we're just going to loop through the numbers and print them out
for (int i = start; i <= end; ++i)
{
lambda(i); // commenting this out speeds things up back to normal
}
}
};
void doThing(int i) // "payload" of the lambda function
{
}
int main()
{
auto start = timer.now();
auto stop = timer.now();
// run 4 trials of each number of threads
for (int x = 1; x <= 4; ++x)
{
// test between 1-8 threads
for (int numThreads = 1; numThreads <= 8; ++numThreads)
{
start = timer.now();
// this is the line of code which calls doThing in the loop
parloop(numThreads, 0, 100000000, [](int i) { doThing(i); });
stop = timer.now();
cout << numThreads << " Time = " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() / 1000000.0f << " ms\n";
//cout << "\t\tsimple list, time was " << deltaTime2 / 1000000.0f << " ms\n";
}
}
cin.ignore();
cin.get();
return 0;
}
I'm using std::function to save the reference to the lambda.
That's one possible problem, as std::function is not a zero-runtime-cost abstraction. It is a type-erased wrapper that has a virtual-call like cost when invoking operator() and could also potentially heap-allocate (which could mean a cache-miss per call).
If you want to store your lambda in such a way that does not introduce additional overhead and that allows the compiler to inline it, you should use a template parameter. This is not always possible, but might fit your use case. Example:
template <typename TFunction>
struct parloop
{
public:
std::thread **myThreads;
int numThreads, rangeStart, rangeEnd;
TFunction lambda;
parloop(TFunction&& _lambda,
int _numThreads, int _rangeStart, int _rangeEnd)
: lambda(std::move(_lambda)),
numThreads(_numThreads), rangeStart(_rangeStart),
rangeEnd(_rangeEnd)
{
init();
exit();
}
// ...
To deduce the type of the lambda, you can use an helper function:
template <typename TF, typename... TArgs>
auto make_parloop(TF&& lambda, TArgs&&... xs)
{
return parloop<std::decay_t<TF>>(
std::forward<TF>(lambda), std::forward<TArgs>(xs)...);
}
Usage:
auto p = make_parloop([](int i) { doThing(i); },
numThreads, 0, 100000000);
I wrote an article that's related to the subject:
"Passing functions to functions"
It contains some benchmarks that show how much assembly is generated for std::function compared to a template parameter and other solutions.
I'm writing a simple bitset wrapper to easily and efficiently set, clear and read bits from an 8-bit integer. I would do these three operations via operator[], but I'm stucked, and honestly I'm not sure it is possible without losing performances (really important for my purposes).
#include <stdint.h>
#include <iostream>
class BitMask {
private:
uint8_t mask;
public:
inline void set(int bit) { mask |= 1 << bit; } // deprecated
inline void clear(int bit) { mask &= ~(1 << bit); } // deprecated
inline int read(int bit) { return (mask >> bit) & 1; } // deprecated
bool operator[](int bit) const { return (mask >> bit) & 1; } // new way to read
??? // new way to write
friend std::ostream& operator<<(std::ostream& os, const BitMask& bitmask) {
for (int bit = 0; bit < 8; ++bit)
os << ((bitmask.mask >> bit) & 1 ? "1" : "0");
return os;
}
};
int main() {
BitMask bitmask1;
bitmask1.set(3);
bitmask1.clear(3);
bitmask1.read(3);
std::cout << bitmask1;
BitMask bitmask2;
bitmask2[3] = 1; // set
bitmask2[3] = 0; // clear
bitmask2[3]; // read
std::cout << bitmask2;
}
Any idea?
One way (the only way?) is to return a proxy object from your operator[] which will hold index for your bit, this way you will assign new value to your proxy object which will alter appropriate BitMask bit. For example see here: Vector, proxy class and dot operator in C++
As for a performance - it all depends how compiler will optimize your code, if your proxy class would have only inline methods then it should be fast.
Below is example how to fix your code:
#include <stdint.h>
#include <iostream>
class BitMask {
private:
uint8_t mask;
public:
inline void set(int bit) { mask |= 1 << bit; } // deprecated
inline void clear(int bit) { mask &= ~(1 << bit); } // deprecated
inline int read(int bit) const { return (mask >> bit) & 1; } // deprecated
struct proxy_bit
{
BitMask& bitmask;
int index;
proxy_bit(BitMask& p_bitmask, int p_index) : bitmask(p_bitmask), index(p_index) {}
proxy_bit& operator=(int rhs) {
if (rhs)
bitmask.set(index);
else
bitmask.clear(index);
return *this;
}
operator int() {
return bitmask.read(index);
}
};
proxy_bit operator[](int bit) { return proxy_bit(*this, bit); } // new way to read
int operator[](int bit) const { return read(bit); } // new way to read
friend std::ostream& operator<<(std::ostream& os, const BitMask& bitmask) {
for (int bit = 0; bit < 8; ++bit)
os << ((bitmask.mask >> bit) & 1 ? "1" : "0");
return os;
}
};
int main() {
BitMask bitmask1;
bitmask1.set(3);
bitmask1.clear(3);
bitmask1.read(3);
std::cout << bitmask1 << std::endl;
BitMask bitmask2;
bitmask2[3] = 1; // set
bitmask2[3] = 0; // clear
bitmask2[3]; // read
std::cout << bitmask2 << std::endl;
const BitMask bitmask3;
if (bitmask3[3]) {}
//bitmask3[3] = 1; // compile error - OK!
}
I'm trying to figure out how much the execution time of boost::variant differ from a polymorphism approach. In my first test I got very different results on gcc 4.9.1 and clang+llvm 3.5.
You can find the code below. Here are my results:
clang+llvm
polymorphism: 2.16401
boost::variant: 3.83487
gcc:
polymorphism: 2.46161
boost::variant: 1.33326
I compiled both with -O3.
Is someone able to explain that?
code
#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/variant.hpp>
#include <boost/variant/apply_visitor.hpp>
#include <ctime>
struct value_type {
value_type() {}
virtual ~value_type() {}
virtual void inc() = 0;
};
struct int_type : value_type {
int_type() : value_type() {}
virtual ~int_type() {}
void inc() { value += 1; }
private:
int value = 0;
};
struct float_type : value_type {
float_type() : value_type() {}
virtual ~float_type() {}
void inc() { value += 1; }
private:
float value = 0;
};
void dyn_test() {
std::vector<std::unique_ptr<value_type>> v;
for (int i = 0; i < 1024; i++) {
if (i % 2 == 0)
v.emplace_back(new int_type());
else
v.emplace_back(new float_type());
}
for (int i = 0; i < 900000; i++) {
std::for_each(v.begin(), v.end(), [](auto &item) { item->inc(); });
}
}
struct visitor : boost::static_visitor<> {
template <typename T> void operator()(T &item) { item += 1; }
};
using mytype = boost::variant<int, float>;
void static_test() {
std::vector<mytype> v;
for (int i = 0; i < 1024; i++) {
if (i % 2 == 0)
v.emplace_back(0);
else
v.emplace_back(0.f);
}
visitor vi;
for (int i = 0; i < 900000; i++) {
std::for_each(v.begin(), v.end(), boost::apply_visitor(vi));
}
}
template <typename F> double measure(F f) {
clock_t start = clock();
f();
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
return seconds;
}
int main() {
std::cout << "polymorphism: " << measure([] { dyn_test(); }) << std::endl;
std::cout << "boost::variant: " << measure([] { static_test(); }) << std::endl;
return 0;
}
assembler
gcc
clang+llvm
Clang is known to miscompile some std::vector functions from various Standard libraries, due to some edge cases in their inliner. I don't know if those have been fixed by now but quite possibly not. Since unique_ptr is smaller and simpler than boost::variant it's more likely that it does not trigger these edge cases.
The code you post is practically "Why boost::variant is great". A dynamic allocation and random pointer index in addition to the regular indirections that both perform? That's a heavy hit (relatively).
I expect to get numbers from 0 to 4 in random order, but instead, I have some unsynchronized mess
What i do wrong?
#include <iostream>
#include <windows.h>
#include <process.h>
using namespace std;
void addQuery(void *v );
HANDLE ghMutex;
int main()
{
HANDLE hs[5];
ghMutex = CreateMutex( NULL, FALSE, NULL);
for(int i=0; i<5; ++i)
{
hs[i] = (HANDLE)_beginthread(addQuery, 0, (void *)&i);
if (hs[i] == NULL)
{
printf("error\n"); return -1;
}
}
printf("WaitForMultipleObjects return: %d error: %d\n",
(DWORD)WaitForMultipleObjects(5, hs, TRUE, INFINITE), GetLastError());
return 0;
}
void addQuery(void *v )
{
int t = *((int*)v);
WaitForSingleObject(ghMutex, INFINITE);
cout << t << endl;
ReleaseMutex(ghMutex);
_endthread();
}
You have to read and write the shared variable inside the lock. You are reading it outside of the lock and thus rendering the lock irrelevant.
But even that's not enough since your shared variable is a loop variable that you are writing to without protection of the lock. A much better example would run like this:
#include <iostream>
#include <windows.h>
#include <process.h>
using namespace std;
void addQuery(void *v );
HANDLE ghMutex;
int counter = 0;
int main()
{
HANDLE hs[5];
ghMutex = CreateMutex( NULL, FALSE, NULL);
for(int i=0; i<5; ++i)
{
hs[i] = (HANDLE)_beginthread(addQuery, 0, NULL);
if (hs[i] == NULL)
{
printf("error\n"); return -1;
}
}
printf("WaitForMultipleObjects return: %d error: %d\n",
(DWORD)WaitForMultipleObjects(5, hs, TRUE, INFINITE), GetLastError());
return 0;
}
void addQuery(void *v)
{
WaitForSingleObject(ghMutex, INFINITE);
cout << counter << endl;
counter++;
ReleaseMutex(ghMutex);
_endthread();
}
If you can, use a critical section rather than a mutex because they are simpler to use and more efficient. But they have the same semantics in that they only protect code inside the locking block.
Note: Jerry has pointer out some other problems, but I've concentrated on the high level trheading and serialization concerns.
Your synchronization has some issues as you want to get numbers from 0 to 4 in random order.
The problem is that the variable i is write outside the lock and every time the addQuery method get called by the execution of a thread, it get the modified version of variable i. That why you may see 5 as the value at the output for all.
So, here is my fix for this scenario. Instead of pass the address of variable i in parameters of the function addQuery, you should pass it's value. Hope it helps:
#include <iostream>
#include <windows.h>
#include <process.h>
using namespace std;
void addQuery(void *v);
HANDLE ghMutex;
int main()
{
HANDLE hs[5];
ghMutex = CreateMutex(NULL, FALSE, NULL);
for (int i = 0; i<5; ++i)
{
hs[i] = (HANDLE)_beginthread(addQuery, 0, (void *)i);
if (hs[i] == NULL)
{
printf("error\n"); return -1;
}
}
printf("WaitForMultipleObjects return: %d error: %d\n",
(DWORD)WaitForMultipleObjects(5, hs, TRUE, INFINITE), GetLastError());
return 0;
}
void addQuery(void *v)
{
int t = (int)v;
WaitForSingleObject(ghMutex, INFINITE);
cout << t << endl;
ReleaseMutex(ghMutex);
_endthread();
}