libtorrent - storage_interface readv explanation - libtorrent

I have implemented a custom storage interface in libtorrent as described in the help section here.
The storage_interface is working fine, although I can't figure out why readv is only called randomly while downloading a torrent. From my view the overriden virtual function readv should get called each time I call handle->read_piece in piece_finished_alert. It should read the piece for read_piece_alert?
The buffer is provided in read_piece_alert without getting notified in readv.
So the question is why it is called only randomly and why it's not called on a read_piece() call? Is my storage_interface maybe wrong?
The code looks like this:
struct temp_storage : storage_interface
virtual int readv(file::iovec_t const* bufs, int num_bufs
, int piece, int offset, int flags, storage_error& ec)
// Only called on random pieces while downloading a larger torrent
std::map<int, std::vector<char> >::const_iterator i = m_file_data.find(piece);
if (i == m_file_data.end()) return 0;
int available = i->second.size() - offset;
if (available <= 0) return 0;
if (available > num_bufs) available = num_bufs;
memcpy(&bufs, &i->second[offset], available);
return available;
virtual int writev(file::iovec_t const* bufs, int num_bufs
, int piece, int offset, int flags, storage_error& ec)
std::vector<char>& data = m_file_data[piece];
if (data.size() < offset + num_bufs) data.resize(offset + num_bufs);
std::memcpy(&data[offset], bufs, num_bufs);
return num_bufs;
virtual bool has_any_file(storage_error& ec) { return false; }
virtual ...
virtual ...
Intialized with
storage_interface* temp_storage_constructor(storage_params const& params)
printf("NEW INTERFACE\n");
return new temp_storage(*params.files);
} = &temp_storage_constructor;
The function below sets up alerts and invokes read_piece on each completed piece.
while(true) {
std::vector<alert*> alerts;
for (alert* i : alerts)
switch (i->type()) {
case read_piece_alert::alert_type:
read_piece_alert* p = (read_piece_alert*)i;
if (p->ec) {
// read_piece failed
// piece buffer, size is provided without readv
// notification after invoking read_piece in piece_finished_alert
case piece_finished_alert::alert_type: {
piece_finished_alert* p = (piece_finished_alert*)i;
// Once the piece is finished, we read it to obtain the buffer in read_piece_alert.

I will answer my own question. As Arvid said in the comments: readv was not invoked because of caching. Setting settings_pack::use_read_cache to false will invoke readv always.


What is the stack used for in CPython, if anything?

As far as I understand:
The OS kernel (e.g. Linux) always allocates a stack for each system-level thread when a thread is created.
CPython is known for using a private heap for its objects, including presumably the call stack for Python subroutines.
If so, what is the stack used for in CPython, if anything?
CPython is an ordinary C program. There is no magic in running Python script / module / REPL / whatever: every piece of code must be read, parsed, interpreted — in a loop, until it's done. There is whole bunch of processor instructions behind every Python expression and statement.
Every "simple" top-level thing (parsing and production of bytecode, GIL management, attribute lookup, console I/O, etc) is very complex under the hood. If consists of functions, calling other functions, calling other functions... which means there is stack involved. Seriously, check it yourself: some of the source files span few thousand lines of code.
Just reaching the main loop of the interpreter is an adventure on it's own. Here is the gist, sewed from pieces from all around the code base:
int wmain(int argc, wchar_t **argv)
return Py_Main(argc, argv);
// standard C entry point
int Py_Main(int argc, wchar_t **argv)
_PyArgv args = /* ... */;
return pymain_main(&args);
static int pymain_main(_PyArgv *args)
// ... calling some initialization routines and checking for errors ...
return Py_RunMain();
int Py_RunMain(void)
int exitcode = 0;
// ... clean-up ...
return exitcode;
static void pymain_run_python(int *exitcode)
// ... initializing interpreter state and startup config ...
// ... determining main import path ...
if (config->run_command) {
*exitcode = pymain_run_command(config->run_command, &cf);
else if (config->run_module) {
*exitcode = pymain_run_module(config->run_module, 1);
else if (main_importer_path != NULL) {
*exitcode = pymain_run_module(L"__main__", 0);
else if (config->run_filename != NULL) {
*exitcode = pymain_run_file(config, &cf);
else {
*exitcode = pymain_run_stdin(config, &cf);
// ... clean-up
int PyRun_AnyFileExFlags(FILE *fp, const char *filename, int closeit, PyCompilerFlags *flags)
// ... even more routing ...
int err = PyRun_InteractiveLoopFlags(fp, filename, flags);
// ...
int PyRun_InteractiveLoopFlags(FILE *fp, const char *filename_str, PyCompilerFlags *flags)
// ... more initializing ...
do {
ret = PyRun_InteractiveOneObjectEx(fp, filename, flags);
// ... error handling ...
} while (ret != E_EOF);
// ...

Stored lambda function calls are very slow - fix or workaround?

In an attempt to make a more usable version of the code I wrote for an answer to another question, I used a lambda function to process an individual unit. This is a work in progress. I've got the "client" syntax looking pretty nice:
// for loop split into 4 threads, calling doThing for each index
parloop(4, 0, 100000000, [](int i) { doThing(i); });
However, I have an issue. Whenever I call the saved lambda, it takes up a ton of CPU time. doThing itself is an empty stub. If I just comment out the internal call to the lambda, then the speed returns to normal (4 times speedup for 4 threads). I'm using std::function to save the reference to the lambda.
My question is - Is there some better way that the stl library internally manages lambdas for large sets of data, that I haven't come across?
struct parloop
std::vector<std::thread> myThreads;
int numThreads, rangeStart, rangeEnd;
std::function<void (int)> lambda;
parloop(int _numThreads, int _rangeStart, int _rangeEnd, std::function<void(int)> _lambda) //
: numThreads(_numThreads), rangeStart(_rangeStart), rangeEnd(_rangeEnd), lambda(_lambda) //
void init()
for (int i = 0; i < numThreads; ++i)
myThreads[i] = std::thread(myThreadFunction, this, chunkStart(i), chunkEnd(i));
void exit()
for (int i = 0; i < numThreads; ++i)
int rangeJump()
return ceil(float(rangeEnd - rangeStart) / float(numThreads));
int chunkStart(int i)
return rangeJump() * i;
int chunkEnd(int i)
return std::min(rangeJump() * (i + 1) - 1, rangeEnd);
static void myThreadFunction(parloop *self, int start, int end) //
std::function<void(int)> lambda = self->lambda;
// we're just going to loop through the numbers and print them out
for (int i = start; i <= end; ++i)
lambda(i); // commenting this out speeds things up back to normal
void doThing(int i) // "payload" of the lambda function
int main()
auto start =;
auto stop =;
// run 4 trials of each number of threads
for (int x = 1; x <= 4; ++x)
// test between 1-8 threads
for (int numThreads = 1; numThreads <= 8; ++numThreads)
start =;
// this is the line of code which calls doThing in the loop
parloop(numThreads, 0, 100000000, [](int i) { doThing(i); });
stop =;
cout << numThreads << " Time = " << std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start).count() / 1000000.0f << " ms\n";
//cout << "\t\tsimple list, time was " << deltaTime2 / 1000000.0f << " ms\n";
return 0;
I'm using std::function to save the reference to the lambda.
That's one possible problem, as std::function is not a zero-runtime-cost abstraction. It is a type-erased wrapper that has a virtual-call like cost when invoking operator() and could also potentially heap-allocate (which could mean a cache-miss per call).
If you want to store your lambda in such a way that does not introduce additional overhead and that allows the compiler to inline it, you should use a template parameter. This is not always possible, but might fit your use case. Example:
template <typename TFunction>
struct parloop
std::thread **myThreads;
int numThreads, rangeStart, rangeEnd;
TFunction lambda;
parloop(TFunction&& _lambda,
int _numThreads, int _rangeStart, int _rangeEnd)
: lambda(std::move(_lambda)),
numThreads(_numThreads), rangeStart(_rangeStart),
// ...
To deduce the type of the lambda, you can use an helper function:
template <typename TF, typename... TArgs>
auto make_parloop(TF&& lambda, TArgs&&... xs)
return parloop<std::decay_t<TF>>(
std::forward<TF>(lambda), std::forward<TArgs>(xs)...);
auto p = make_parloop([](int i) { doThing(i); },
numThreads, 0, 100000000);
I wrote an article that's related to the subject:
"Passing functions to functions"
It contains some benchmarks that show how much assembly is generated for std::function compared to a template parameter and other solutions.

C++11: How to implement fast, lightweight, and fair synchronized resource access

What can I do to get a locking mechanism that provides minimal and stable latency while guaranteeing that a thread cannot reacquire a resource before another thread has acquired and released it?
The desirability of answers to this question are ranked as follows:
Some combination of built-in C++11 features that work in MinGW on Windows 7 (note that the <thread> and <mutex> libraries do not work on a Windows platform)
Some combination of Windows API features
A modification to the FairLock listed below, my own attempt at implementing such a mechanism
Some features provided by a free, open-source library that does not require a .configure/make/make install process, (getting that to work in MSYS is more of an adventure than I care for)
I am writing an application which is effectively a multi-stage producer/consumer. One thread generates input consumed by another thread, which produces output consumed by yet another thread. The application uses pairs of buffers so that, after an initial delay, all threads can work nearly simultaneously.
Since I am writing a Windows 7 application, I had been using CriticalSections to guard the buffers. The problem with using CriticalSections (or, so far as I can tell, any other Windows or C++11-built-in synchronization object) is that it does not allow for any provision that a thread that just released a lock cannot reacquire it until another thread has done so first. Because of this, many of my test drivers for the middle thread (the Encoder) never gave the Encoder a chance to acquire the test input buffers and completed without having tested them. The end result was a ridiculous process of trying to determine an artificial wait time that stochastically worked for my machine.
Since the structure of my application requires that each stage waits for the other stage to have acquired, finished using, and released the necessary buffers for getting to use the buffer again, I need, for lack of a better term, a fair locking mechanism. I took a crack at writing one (the source code is provided below). In testing, this FairLock allows my test driver to run my Encoder at the same speeds that I was able to achieve using the CriticalSection maybe 60% of the runs. The other 40% of the runs take anywhere between 10 to 100 ms longer, which is not acceptable for my application.
// FairLock.hpp
#include <atomic>
using namespace std;
class FairLock {
atomic_bool owned {false};
atomic<DWORD> lastOwner {0};
FairLock(bool owned);
bool inline hasLock() const;
bool tryLock();
void seizeLock();
void tryRelease();
void waitForLock();
// FairLock.cpp
#include <windows.h>
#include "FairLock.hpp"
#define ID GetCurrentThreadId()
FairLock::FairLock(bool owned) {
if (owned) {
this->owned = true;
this->lastOwner = ID;
} else {
this->owned = false;
this->lastOwner = 0;
bool inline FairLock::hasLock() const {
return owned && lastOwner == ID;
bool FairLock::tryLock() {
bool success = false;
DWORD id = ID;
if (owned) {
success = lastOwner == id;
} else if (
lastOwner != id &&
owned.compare_exchange_strong(success, true)
) {
lastOwner = id;
success = true;
} else {
success = false;
return success;
void FairLock::seizeLock() {
bool success = false;
DWORD id = ID;
if (!(owned && lastOwner == id)) {
while (!owned.compare_exchange_strong(success, true)) {
success = false;
lastOwner = id;
void FairLock::tryRelease() {
if (hasLock()) {
owned = false;
void FairLock::waitForLock() {
bool success = false;
DWORD id = ID;
if (!(owned && lastOwner == id)) {
while (lastOwner == id); // spin
while (!owned.compare_exchange_strong(success, true)) {
success = false;
lastOwner = id;
I reviewed the above code to compare it against The C++ Programming Language: 4th Edition text I had not read carefully and what CouchDeveloper's recommended Synchronous Queue. I realized that there are several sequences in which the thread that just released the FairLock can be tricked into thinking it still owns it. All it takes is interleaving instructions as follows:
New owner: set owned to true
Old owner: is owned true? yes
Old owner: am I the last owner? yes
New owner: set me as the last owner
At this point, the old and new owners both enter their critical sections.
I am considering whether this problem has a solution and whether it is worth attempting to solve this at all. In the meantime, don't use this unless you see a fix.
I would implement this in C++11 using a condition_variable-per-thread setup so that I could choose exactly which thread to wake up when (Live demo at Coliru):
class FairMutex {
class waitnode {
std::condition_variable cv_;
waitnode* next_ = nullptr;
FairMutex& fmtx_;
waitnode(FairMutex& fmtx) : fmtx_(fmtx) {
*fmtx.tail_ = this;
fmtx.tail_ = &next_;
~waitnode() {
for (waitnode** p = &fmtx_.waiters_; *p; p = &(*p)->next_) {
if (*p == this) {
*p = next_;
if (!next_) {
fmtx_.tail_ = &fmtx_.waiters_;
void wait(std::unique_lock<std::mutex>& lk) {
while (fmtx_.held_ || fmtx_.waiters_ != this) {
void notify() {
waitnode* waiters_ = nullptr;
waitnode** tail_ = &waiters_;
std::mutex mtx_;
bool held_ = false;
void lock() {
auto lk = std::unique_lock<std::mutex>{mtx_};
if (held_ || waiters_) {
held_ = true;
bool try_lock() {
if (mtx_.try_lock()) {
std::lock_guard<std::mutex> lk(mtx_, std::adopt_lock);
if (!held_ && !waiters_) {
held_ = true;
return true;
return false;
void unlock() {
std::lock_guard<std::mutex> lk(mtx_);
held_ = false;
if (waiters_ != nullptr) {
FairMutex models the Lockable concept so it can be used like any other standard library mutex type. Put simply, it achieves fairness by inserting waiters into a list in arrival order, and passing the mutex to the first waiter in the list when unlocking.
If it's useful:
This demonstrates *) an implementation of a "synchronous queue" using semaphores as synchronization primitives.
Note: the actually implementation uses semaphores implemented with GCD (Grand Central Dispatch):
using gcd::mutex;
using gcd::semaphore;
// A blocking queue in which each put must wait for a get, and vice
// versa. A synchronous queue does not have any internal capacity,
// not even a capacity of one.
template <typename T>
class simple_synchronous_queue {
typedef T value_type;
enum result_type {
OK = 0,
: sync_(0), send_(1), recv_(0)
void put(const T& v) {
new (address()) T(v);
result_type put(const T& v, double timeout) {
if (send_.wait(timeout)) {
new (storage_) T(v);
if (sync_.wait(timeout)) {
return OK;
else {
else {
T get() {
T result = *address();
return result;
std::pair<result_type, T> get(double timeout) {
if (recv_.wait(timeout)) {
std::pair<result_type, T> result =
std::pair<result_type, T>(OK, *address());
return result;
else {
return std::pair<result_type, T>(TIMEOUT_NOTHING_OFFERED, T());
using storage_t = typename std::aligned_storage<sizeof(T), std::alignment_of<T>::value>::type;
T* address() { 
return static_cast<T*>(static_cast<void*>(&storage_));
storage_t storage_;
semaphore sync_;
semaphore send_;
semaphore recv_;
*) demonstrates: be carefully about potential issues, could be improved, etc. ... ;)
I accepted CouchDeveloper's answer since it pointed me down the right path. I wrote a Windows-specific C++11 implementation of a synchronous queue, and added this answer so that others could consider/use it if they so choose.
// SynchronousQueue.hpp
#include <atomic>
#include <exception>
#include <windows>
using namespace std;
class CouldNotEnterException: public exception {};
class NoPairedCallException: public exception {};
template typename<T>
class SynchronousQueue {
atomic_bool valueReady {false};
CRITICAL_SECTION getCriticalSection;
CRITICAL_SECTION putCriticalSection;
DWORD wait {0};
HANDLE getSemaphore;
HANDLE putSemaphore;
const T* address {nullptr};
SynchronousQueue(DWORD waitMS): wait {waitMS}, address {nullptr} {
getSemaphore = CreateSemaphore(nullptr, 0, 1, nullptr);
putSemaphore = CreateSemaphore(nullptr, 0, 1, nullptr);
~SynchronousQueue() {
void put(const T& value) {
if (!TryEnterCriticalSection(&putCriticalSection)) {
throw CouldNotEnterException();
ReleaseSemaphore(putSemaphore, (LONG) 1, nullptr);
if (WaitForSingleObject(getSemaphore, wait) != WAIT_OBJECT_0) {
if (WaitForSingleObject(putSemaphore, 0) == WAIT_OBJECT_0) {
throw NoPairedCallException();
} else {
WaitForSingleObject(getSemaphore, 0);
address = &value;
valueReady = true;
while (valueReady);
T get() {
if (!TryEnterCriticalSection(&getCriticalSection)) {
throw CouldNotEnterException();
ReleaseSemaphore(getSemaphore, (LONG) 1, nullptr);
if (WaitForSingleObject(putSemaphore, wait) != WAIT_OBJECT_0) {
if (WaitForSingleObject(getSemaphore, 0) == WAIT_OBJECT_0) {
throw NoPairedCallException();
} else {
WaitForSingleObject(putSemaphore, 0);
while (!valueReady);
T toReturn = *address;
valueReady = false;
return toReturn;

The semantics of v8::ResourceConstraints?

The v8::ResourceConstraints class is defined as follows:
class V8EXPORT ResourceConstraints {
int max_young_space_size() const { return max_young_space_size_; }
void set_max_young_space_size(int value) { max_young_space_size_ = value; }
int max_old_space_size() const { return max_old_space_size_; }
void set_max_old_space_size(int value) { max_old_space_size_ = value; }
int max_executable_size() { return max_executable_size_; }
void set_max_executable_size(int value) { max_executable_size_ = value; }
uint32_t* stack_limit() const { return stack_limit_; }
// Sets an address beyond which the VM's stack may not grow.
void set_stack_limit(uint32_t* value) { stack_limit_ = value; }
int max_young_space_size_;
int max_old_space_size_;
int max_executable_size_;
uint32_t* stack_limit_;
Can someone tell me what young_space_size, old_space_size, and max_executable_size are? What are their units, how are they related, etc.? There doesn't seem to be much documentation.
Also, how does one use the stack_limit property? For example, if I want my V8 isolate to use no more than 1MB of stack space, how would I calculate a pointer value for stack_limit?
v8/test/cctest/ uses this function to calculate the limit:
// Uses the address of a local variable to determine the stack top now.
// Given a size, returns an address that is that far from the current
// top of stack.
static uint32_t* ComputeStackLimit(uint32_t size) {
uint32_t* answer = &size - (size / sizeof(size));
// If the size is very large and the stack is very near the bottom of
// memory then the calculation above may wrap around and give an address
// that is above the (downwards-growing) stack. In that case we return
// a very low address.
if (answer > &size) return reinterpret_cast<uint32_t*>(sizeof(size));
return answer;

Sharing an object between threads

How would you set the object data that is shared between threads and needs to be updated once after the complete cycle of (say) two threads in busy loop?
CRITICAL_SECTION critical_section_;
int value; //needs to be updated once after the cycle of any number of threads running in busy loop
void ThreadsFunction(int i)
while (true)
/* Lines of Code */
Edit: The value can be an object of any class.
Two suggestions:
Make the object itself thread safe.
Pass the object into the thread as instance data
I'll use C++ as a reference in my example. You can easily transpose this to pure C if you want.
// MyObject is the core data you want to share between threads
struct MyObject
int value;
int othervalue;
// all all the other members you want here
class MyThreadSafeObject
MyObject _myojbect;
bool _fLocked;
_fLocked = false
// add "getter and setter" methods for each member in MyObject
int SetValue(int x)
_myobject.value = x;
int GetValue()
int x;
x = _myobject.value;
return x;
// add "getter and setter" methods for each member in MyObject
int SetOtherValue(int x)
_myobject.othervalue = x;
int GetOtherValue()
int x;
x = _myobject.othervalue;
return x;
// and if you need to access the whole object directly without using a critsec lock on each variable access, add lock/unlock methods
bool Lock(MyObject** ppObject)
*ppObject = &_myobject;
_fLocked = true;
return true;
bool UnLock()
if (_fLocked == false)
return false;
_fLocked = false;
return true;
Then, create your object and thread as follows:
MyThreadSafeObject* pObjectThreadSafe;
MyObject* pObject = NULL;
// now initilaize your object
pObject->value = 0; // initailze value and all the other members of pObject to what you want them to be.
pObject->othervalue = 0;
pObject = NULL;
// Create your threads, passing the pointer to MyThreadSafeObject as your instance data
DWORD dwThreadID = 0;
HANDLE hThread = CreateThread(NULL, NULL, ThreadRoutine, pObjectThreadSafe, 0, &dwThreadID);
And your thread will operate as follows
DWORD __stdcall ThreadFunction(void* pData)
MyThreadSafeObject* pObjectThreadSafe = (MyThreadSafeObject*)pData;
MyObject* pObject = NULL;
while (true)
/* lines of code */
/* lines of code */
If you want implement thread safe update of an integer you should better use InterlockedIncrement and InterlockedDecrement or InterlockedExchangeAdd functions. See
If you do need use EnterCriticalSection and LeaveCriticalSection you will find an example in, but I recommend you to use EnterCriticalSection inside of __try block and LeaveCriticalSection inside of the __finally part of this blocks.
