Implementing a lock-free stack with OpenMP: compare-and-swap

Implementing a lock-free stack with OpenMP: compare-and-swap - openmp

I'm writing a C library on which I want to optionally support concurrence by using OpenMP (so that one may compile it serially if the compiler does not support OpenMP). I'd like to use a lock-free stack implementation.
I thought about using C's stdatomic.h for the stack, but seems that until a few weeks ago, GCC couldn't use _Atomic with OpenMP, so this would complicate portability. Clang 3.8 seems to handle atomics with OpenMP correctly, but still this would not be the best choice since there's no need to keep it atomic when compiling without OpenMP (thus serially).
I seem to need to use a compare-and-exchange operation when popping from the stack, and I couldn't find any information regarding compare-and-exchange on OpenMP. Is there any way to implement a lock-free stack solely with OpenMP?
My code so far (works with clang):
struct lfstack_node {
void *value;
struct lfstack_node *next;
};
typedef struct lfstack {
_Atomic(size_t) size;
_Atomic(struct lfstack_node *) head;
_Atomic int aba;
} *lfstack_t;
// ...
void *lfstack_pop(lfstack_t stack) {
if(stack) {
atomic_fetch_add(&stack->aba, 1);
struct lfstack_node *node, *next;
do {
node = atomic_load(&stack->head);
if(!node) {
break;
}
// ABA problem here if not handled correctly
next = node->next;
} while(!atomic_compare_exchange_weak(&stack->head, &node, next));
atomic_fetch_sub(&stack->aba, 1);
if(node) {
int zero = 0;
while(!atomic_compare_exchange_weak(&stack->aba, &zero, zero)) {
continue;
}
void *value = node->value;
free(node);
return value;
}
}
return NULL;
}

Related

using C++11 templates to generate multiple versions of an algorithm

Say I'm making a general-purpose collection of some sort, and there are 4-5 points where a user might want to choose implementation A or B. For instance:
homogenous or heterogenous
do we maintain a count of the contained objects, which is slower
do we have it be thread-safe or not
I could just make 16 or 32 implementations, with each combination of features, but obviously this won't be easy to write or maintain.
I could pass in boolean flags to the constructor, that the class could check before doing certain operations. However, the compiler doesn't "know" what those arguments were so has to check them every time, and just checking enough boolean flags itself imposes a performance penalty.
So I'm wondering if template arguments can somehow be used so that at compile time the compiler sees if (false) or if (true) and therefore can completely optimize out the condition test, and if false, the conditional code. I've only found examples of templates as types, however, not as compile-time constants.
The main goal would be to utterly eliminate those calls to lock mutexes, increment and decrement counters, and so on, but additionally, if there's some way to actually remove the mutex or counters from the object structure as well that's be truly optimal.

Conditional computation before 17 was mostly about template specialization. Either specializing the function itself
template<> void f<int>(int) {
std::cout << "Locking an int...\n";
std::cout << "Unlocking an int...\n";
}
template<> void f<std::mutex>(std::mutex &m) {
m.lock();
m.unlock();
}
But this actually creates a rather branchy code (in your case I suspect), so a more sound alternative would be to extract all the dependent, type-specific, parts into static interface and define a static implementation of it for a particular concrete type:
template<class T> struct lock_traits; // interface
template<> struct lock_traits<int> {
void lock(int &) { std::cout << "Locking an int...\n"; }
void unlock(int &) { std::cout << "Unlocking an int...\n"; }
};
template<> struct lock_traits<std::mutex> {
void lock(std::mutex &m) { m.lock(); }
void unlock(std::mutex &m) { m.unlock(); }
};
template<class T> void f(T &t) {
lock_traits<T>::lock(t);
lock_traits<T>::unlock(t);
}
In C++17 if constrexpr was finally introduced, now not all branches do have to compile in all circumstances.
template<class T> void f(T &t) {
if constexpr<std::is_same_v<T, std::mutex>> {
t.lock();
}
else if constexpr<std::is_same_v<T, int>> {
std::cout << "Locking an int...\n";
}
if constexpr<std::is_same_v<T, std::mutex>> {
t.unlock();
}
// forgot to unlock an int here :(
}

Is that possible to have a for loop in compile time with runtime or even compile time limit condition in c++11?

I would like to know if it is possible to have a for loop in compile time with runtime or even compile time limit condition in c++11?
I start with a silly try to find out what I need.
for (uint32_t i = 0; i < n; ++i)
{
templated_func<i>();
}
consider I have a class with a private member variable n, and I want to call a template function with a different number that iterates from 0 to n (for the case of runtime limit condition)
I've had studies on the "Template Metaprogramming" and "Constexpr If" (c++17) but I have not gotten any results, can anyone help me?

You can't have a for loop, but you can call N lots of templated_func
namespace detail {
template <template<uint32_t> class F, uint32_t... Is>
void static_for_impl(std::integer_sequence<uint32_t, Is...>)
{
F<Is>{}()...;
}
}
template <template<uint32_t> class F, uint32_t N>
void static_for()
{
detail::static_for_impl<F>(std::make_integer_sequence<uint32_t, N>{});
}
template <uint32_t I>
struct templated_caller
{
void operator()() { templated_func<I>(); }
}
int main()
{
static_for<templated_caller, 10>();
return 0;
}
Note that this is more general than what you asked for. You can simplify it to just
template <uint32_t... Is>
void call_templated_func(std::integer_sequence<uint32_t, Is...>)
{
templated_func<Is>()...;
}
int main()
{
call_templated_func(std::make_integer_sequence<uint32_t, N>{});
return 0;
}
but that's lengthy to repeat multiple times, and you can't pass a function template as a template parameter.

As you said you only had C++11 then you will not have std::make_index_sequence and will have to provide it. Also, the fold expression in Caleth's answer is not available until C++17.
Providing your own implementation of index_sequence and a fold expression in c++11 can be done in the following way,
#include <iostream>
template <size_t... Is>
struct index_sequence{};
namespace detail {
template <size_t I,size_t...Is>
struct make_index_sequence_impl : make_index_sequence_impl<I-1,I-1,Is...> {};
template <size_t...Is>
struct make_index_sequence_impl<0,Is...>
{
using type = index_sequence<Is...>;
};
}
template<size_t N>
using make_index_sequence = typename detail::make_index_sequence_impl<N>::type;
template<size_t I>
void templated_func()
{
std::cout << "templated_func" << I << std::endl;
}
template <size_t... Is>
void call_templated_func(index_sequence< Is...>)
{
using do_ = int[];
do_ {0,(templated_func<Is>(),0)...,0};
}
int main()
{
call_templated_func(make_index_sequence< 10>());
return 0;
}
This is essentially the same as the answer by #Caleth , but with the missing bits provided and will compile on c++11.
demo
demo on c++11 compiler

I would like to know if it is possible to have a for loop in compile time with runtime or even compile time limit condition in c++11
I don't know a reasonable way to have such loop with a runtime condition.
With a compile time condition... If you can use at least C++14, you can use a solution based on std::integer_sequence/std::make_integer_sequence (see Caleth answer) or maybe std::index_sequence/std::make_index_sequence (just a little more synthetic).
If you're limited with C++11, you can create a surrogate for std::index_sequence/std::make_index_sequence or you can create a recursive template struct with static function (unfortunately you can partially specialize a template function but you can partially specialize classes and structs).
I mean... something as follows
template <std::size_t I, std::size_t Top>
struct for_loop
{
static void func ()
{
templated_func<I>();
for_loop<I+1u, Top>::func();
}
};
template <std::size_t I>
struct for_loop<I, I>
{ static void func () { } };
that you can call
constexpr auto n = 10u;
for_loop<0, n>::func();
if you want to call templated_func() with values from zero to n-1u.
Unfortunately this solution is recursive so you can have troubles with compilers recursion limits. That is... works only if n isn't high.

Shared buffer using boost::intrusive_ptr

I have a use case where one thread reads message into a large buffer and the distributes the processing to a bunch of threads. The buffer is shared by multiple threads after that. Its read-only and when the last thread finishes, the buffer has to be freed. The buffer is allocated from a lock-free slab allocator.
My initial design was to use shared_ptr for the buffer. But the buffer can be of different size. My way of getting around it was do something like this.
struct SharedBuffer {
SharedBuffer (uint16_t len, std::shared_ptr<void> ptr)
: _length(len), _buf(std::move(ptr))
{
}
uint8_t data () { return (uint8_t *)_buf.get(); }
uint16_t length
std::shared_ptr<void> _buf; // type-erase the shared_ptr as the SharedBuffer
// need to stored in some other structs
};
Now the allocator will allocate the shared_ptr like this:
SharedBuffer allocate (size_t size)
{
auto buf = std::allocate_shared<std::array<uint8_t, 16_K>>(myallocator);
return SharedBuffer{16_K, buf}; // type erase the std::array
}
And the SharedBuffer is enqueued to each thread who wants it.
Now I think, I am doing lot of stuff unnecessarily, I can sort of make do with boost::intrusive_ptr with the below scheme. Things are bit C'ish- as I am using variable size array. Here I have changed the slab allocator with a operator new() for the sake of simplicity. I wanted to run it by to see if this implementation is okay.
template <typename T>
inline int atomicIncrement (T* t)
{
return __atomic_add_fetch(&t->_ref, 1, __ATOMIC_ACQUIRE);
}
template <typename T>
inline int atomicDecrement (T* t)
{
return __atomic_sub_fetch(&t->_ref, 1, __ATOMIC_RELEASE);
}
class SharedBuffer {
public:
friend int atomicIncrement<SharedBuffer>(SharedBuffer*);
friend int atomicDecrement<SharedBuffer>(SharedBuffer*);
SharedBuffer(uint16_t len) : _length(len) {}
uint8_t *data ()
{
return &_data[0];
}
uint16_t length () const
{
return _length;
}
private:
int _ref{0};
const uint16_t _length;
uint8_t _data[];
};
using SharedBufferPtr = boost::intrusive_ptr<SharedBuffer>;
SharedBufferPtr allocate (size_t size)
{
// dummy implementation
void *p = ::operator new (size + sizeof(SharedBuffer));
// I am not explicitly constructing the array of uint8_t
return new (p) SharedBuffer(size);
}
void deallocate (SharedBuffer* sbuf)
{
sbuf->~SharedBuffer();
// dummy implementation
::operator delete ((void *)sbuf);
}
void intrusive_ptr_add_ref(SharedBuffer* sbuf)
{
atomicIncrement(sbuf);
}
void intrusive_ptr_release (SharedBuffer* sbuf)
{
if (atomicDecrement(sbuf) == 0) {
deallocate(sbuf);
}
}

I'd use the simpler implementation (using shared_ptr) unless you are avoiding specific problems (i.e. profile first).
Side Note: you can use boost::shared_pointer<> with boost::make_shared<T[]>(N), which is being [added to the standard library in c++20.
Note that allocate_shared already embeds the control block into the same allocation like you do with the intrusive approach.
Finally, I'd use std::atomic_int so you have a clear contract that cannot (accidentally) be used wrong. At the same time, it'll remove the remaining bit of complexity.

kfree_skb() after skb_dequeue() freezes linux kernel

I'm implementing flow control in a custom protocol in the linux kernel. When I receive an ACK, I want to remove the acked packets from the write queue. Here's some code
for(i = (ack->sequence - qp->first_unack); i>0&&sk->sk_write_queue.qlen>0; i++){
skb_del = skb_dequeue(&sk->sk_write_queue);
qp->first_unack++;
kfree_skb(skb_del);
}
I get a kernel freeze from this code. Everything works well however, when I comment out the kfree(skb_del). Any ideas why is this happening? How else can I free up the memory?

As the skbs are queued to the socket you can use already provided socket APIs;
sk_eat_skb(struct sock *sk, struct sk_buff *skb, bool copied_early) // copied_ealy = 0
For more details you can track tcp_recvmsg, there properly you will get the impementation flow
Moreove why you are using custom APIS from the queuing/dequeuing loop on your own. Just go through the include/net/sock.h I hope you will get necessary details

This is probably because of double freeing skb_del.
Theoretically, before calling kfree_skb(skb_del) you can check the value of skb_del->users by doing refcount_read(&skb_del->users), and if skb_del->users is 0, then it means that skb_del has already been freed.
In practice, the kfree_skb() function doesn't set skb_del->users to 0 when skb_del is finally released (due to some optimization considerations), so after skb_del will be release it would stay 1, and you won't be able to know if skb_del has been released or not.
If you are still curious if this is a double-free issue and you are fine with making some changes in the skbuff infrastructure (just for this investigation) then we need to modify some skbuff functions.
WARNING: It's very easy to cause the kernel to crash when playing with this function, so be careful. But these modification works (in this way I've found a double-free of skb). Keep in mind that this is a suggestion only for investigating the double-free issue, and I've no idea if these modifications will effect your system in the long-run.
We'll modify the following functions (based on kernel v5.9.1):
skb_unref() // from include/linux/skbuff.h
__kfree_skb() // from net/core/skbuff.c
kfree_skb() // from net/core/skbuff.c
consume_skb() // from net/core/skbuff.c
Original skb_unref()
static inline bool skb_unref(struct sk_buff *skb)
{
if (unlikely(!skb))
return false;
if (likely(refcount_read(&skb->users) == 1))
smp_rmb();
else if (likely(!refcount_dec_and_test(&skb->users)))
return false;
return true;
}
Modified skb_unref()
static inline bool skb_unref(struct sk_buff *skb)
{
if (unlikely(!skb))
return false;
if (likely(refcount_read(&skb->users) == 1)) {
smp_rmb();
refcount_set(&skb->users, 0);
} else if (likely(!refcount_dec_and_test(&skb->users))) {
return false;
}
return true;
}
Original __kfree_skb()
void __kfree_skb(struct sk_buff *skb)
{
skb_release_all(skb);
kfree_skbmem(skb);
}
Modified __kfree_skb()
void __kfree_skb(struct sk_buff *skb)
{
if (!skb_unref(skb))
return;
skb_release_all(skb);
kfree_skbmem(skb);
}
Original kfree_skb()
void kfree_skb(struct sk_buff *skb)
{
if (!skb_unref(skb))
return;
trace_kfree_skb(skb, __builtin_return_address(0));
__kfree_skb(skb);
}
Modified kfree_skb()
void kfree_skb(struct sk_buff *skb)
{
//if (!skb_unref(skb))
// return;
trace_kfree_skb(skb, __builtin_return_address(0));
__kfree_skb(skb);
}
Original consume_skb()
void consume_skb(struct sk_buff *skb)
{
if (!skb_unref(skb))
return;
trace_consume_skb(skb);
__kfree_skb(skb);
}
Modified consume_skb()
void consume_skb(struct sk_buff *skb)
{
//if (!skb_unref(skb))
// return;
trace_consume_skb(skb);
__kfree_skb(skb);
}
Good luck in the investigation.
May god will be with you.

C++/cli error C4368: cannot define 'list' as a member of managed 'queue': mixed types are not supported

I'm totally new to c++ Gui..
i'm trying to make a simple windows form to draw on my dining philosophers semaphore solution
my semaphore header file
ref class sema4
{
private:
int sem_value;
queue Waiting_List;
public:
sema4();
void wait(HANDLE h);
void signal();
};
My semaphore cpp
sema4::sema4()
{
sem_value=1;
}
//suspend the thread
void sema4::wait(HANDLE h)
{
sem_value = sem_value - 1;
if (sem_value < 0)
{
Waiting_List.enqueue(h);
SuspendThread(h);
}
}
//Resume the thread
void sema4::signal()
{
sem_value = sem_value + 1;
if (sem_value <= 0)
{
ResumeThread(Waiting_List.dequeue());
}
}
My queue header file
ref class queue
{
private:
HANDLE list[20];
int front;
int rear;
public:
queue();
void enqueue(HANDLE x);
HANDLE dequeue();
bool isempty();
bool isfull();
};
the queue CPP
queue::queue()
{
front=-1;
rear=-1;
}
void queue::enqueue(HANDLE x)
{
if(isfull())
{
cout<<"queue is full";
}
else
{
if(front==-1)
front=0;
rear=(rear+1)%20;
list[rear]=x;
}
}
bool queue::isfull(){
if (front==(rear+1)%20)
return true;
return false;
}
HANDLE queue::dequeue(){
if(isempty())
{
cout<<"queue is empty";
return NULL;
}
else
{
HANDLE x =alist[front];
if (front==rear)
front=rear=-1;
else front = (front + 1) % 20;
}
}
bool queue::isempty()
{
if((front == rear) && (rear == -1))
{
return true;
}
return false;
}
i keep getting the error C4368: cannot define 'list' as a member of managed 'queue': mixed types are not supported
and i have no real experience using c++ windows forms

The simple answer
The compile error you're getting is because queue is a managed type. Managed types need to be declared with a ^, and created using gcnew.
The more complex answer
What you're writing isn't C++ code. This is a language called C++/CLI, which is intended for interop between .Net managed languages such as C# and unmanaged languages such as C and C++. As such, it has all of the complexities of C++, all of the complexities of C#, and a few extra of its own.
While you're just learning, please pick one or the other, and go with that. If you want to write managed code, learn C#. If you want to write unmanaged code, learn C++. Don't tackle C++/CLI while you're still learning.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Implementing a lock-free stack with OpenMP: compare-and-swap - openmp

Related

using C++11 templates to generate multiple versions of an algorithm

Is that possible to have a for loop in compile time with runtime or even compile time limit condition in c++11?

Shared buffer using boost::intrusive_ptr

kfree_skb() after skb_dequeue() freezes linux kernel

C++/cli error C4368: cannot define 'list' as a member of managed 'queue': mixed types are not supported

Categories

Resources