IO Completion ports: separate thread pool to process the dequeued packets? - c++11

NOTE: I have added the C++ tag to this because a) the code is C++ and b) people using C++ may well have used IO completion ports. So please don't shout.
I am playing with IO completion ports, and have eventually fully understood (and tested, to prove) - both with help from RbMm - the meaning of the NumberOfConcurrentThreads parameter within CreateIoCompletionPort().
I have the following small program which creates 10 threads all waiting on the completion port. I tell my completion port to only allow 4 threads to be runnable at once (I have four CPUs). I then enqueue 8 packets to the port. My thread function outputs a message if it dequeues a packet with an ID > 4; in order for this message to be output, I have to stop at least one of the four currently running threads, which happens when I enter '1' at the console.
Now this is all fairly simple code. I have one big concern however, and that is that if all of the threads that are processing a completion packet get bogged down, it will mean no more packets can be dequeued and processed. That is what I am simulating with my infinite loop - the fact that no more packets are dequeued until I enter '1' at the console highlights this potential problem!
Would a better solution not be to have my four threads dequeuing packets (or as many threads as CPUs), then when one is dequeued, farm the processing of that packet off to a worker thread from a separate pool, thereby removing the risk of all threads in the IOCP being bogged down thus no more packets being dequeued?
I ask this as all the examples of IO completion port code I have seen use a method similar to what I show below, not using a separate thread pool which I propose. This is what makes me think that I am missing something because I am outnumbered!
Note: this is a somewhat contrived example, because Windows will allow an additional packet to be dequeued if one of the runnable threads enters a wait state; I show this in my code with a commented out cout call:
The system also allows a thread waiting in GetQueuedCompletionStatus
to process a completion packet if another running thread associated
with the same I/O completion port enters a wait state for other
reasons, for example the SuspendThread function. When the thread in
the wait state begins running again, there may be a brief period when
the number of active threads exceeds the concurrency value. However,
the system quickly reduces this number by not allowing any new active
threads until the number of active threads falls below the concurrency
value.
But I won't be calling SuspendThread in my thread functions, and I don't know which functions other than cout will cause the thread to enter a wait state, thus I can't predict if one or more of my threads will ever get bogged down! Hence my idea of a thread pool; at least context switching would mean that other packets get a chance to be dequeued!
#define _CRT_SECURE_NO_WARNINGS
#include <windows.h>
#include <thread>
#include <vector>
#include <algorithm>
#include <atomic>
#include <ctime>
#include <iostream>
using namespace std;
int main()
{
HANDLE hCompletionPort1;
if ((hCompletionPort1 = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 4)) == NULL)
{
return -1;
}
vector<thread> vecAllThreads;
atomic_bool bStop(false);
// Fill our vector with 10 threads, each of which waits on our IOCP.
generate_n(back_inserter(vecAllThreads), 10, [hCompletionPort1, &bStop] {
thread t([hCompletionPort1, &bStop]()
{
// Thread body
while (true)
{
DWORD dwBytes = 0;
LPOVERLAPPED pOverlapped = 0;
ULONG_PTR uKey;
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
//cout << uKey; // EVEN THIS WILL CAUSE A "wait" which causes MORE THAN 4 THREADS TO ENTER!
if (uKey >4)
cout << "Started processing packet ID > 4!" << endl;
while (!bStop)
; // INFINITE LOOP
}
}
});
return move(t);
}
);
// Queue 8 completion packets to our IOCP...only four will be processed until we set our bool
for (int i = 1; i <= 8; ++i)
{
PostQueuedCompletionStatus(hCompletionPort1, 0, i, new OVERLAPPED);
}
while (!bStop)
{
int nVal;
cout << "Enter 1 to cause current processing threads to end: ";
cin >> nVal;
bStop = (nVal == 1);
}
for (int i = 0; i < 10; ++i) // Tell all 10 threads to stop processing on the IOCP
{
PostQueuedCompletionStatus(hCompletionPort1, 0, 0, 0); // Special packet marking end of IOCP usage
}
for_each(begin(vecAllThreads), end(vecAllThreads), mem_fn(&thread::join));
return 0;
}
EDIT #1
What I mean by "separate thread pool" is something like the following:
class myThread {
public:
void SetTask(LPOVERLAPPED pO) { /* start processing pO*/ }
private:
thread m_thread; // Actual thread object
};
// The threads in this thread pool are not associated with the IOCP in any way whatsoever; they exist
// purely to be handed a completion packet which they then process!
class ThreadPool
{
public:
void Initialise() { /* create 100 worker threads and add them to some internal storage*/}
myThread& GetNextFreeThread() { /* return one of the 100 worker thread we created*/}
} g_threadPool;
The code that each of my four threads associated with the IOCP then change to
if (::GetQueuedCompletionStatus(hCompletionPort1, &dwBytes, &uKey, &pOverlapped, INFINITE) == 1)
{
if (dwBytes == 0 && uKey == 0 && pOverlapped == 0)
break; // Special completion packet; end processing.
// Pick a new thread from a pool of pre-created threads and assign it the packet to process
myThread& thr = g_threadPool.GetNextFreeThread();
thr.SetTask(pOverlapped);
// Now, this thread can immediately return to the IOCP; it doesn't matter if the
// packet we dequeued would take forever to process; that is happening in the
// separate thread thr *that will not intefere with packets being dequeued from IOCP!*
}
This way, there is no possible way that I can end up in the situation where no more packets are being dequeued!

It seems there is conflicting opinion on whether a separate thread pool should be used. Clearly, as the sample code I have posted shows, there is potential for packets to stop being dequeued from the IOCP if the processing of the packets does not enter a wait state; given, the infinite loop is perhaps unrealistic but it does demonstrate the point.

Related

c: socketCAN connection: read() not fast enough

socketCAN connection: read() not fast enough
Hello,
I use the socket() connection for my CAN communication.
fd = socket(PF_CAN, SOCK_RAW, CAN_RAW);
I'm using 2 threads: one periodic 1ms RT thread to send data and one
thread to read the incoming messages. The read function looks like:
void readCan0Socket(void){
int receivedBytes = 0;
do
{
// set GPIO pin low
receivedBytes = read(fd ,
&receiveCanFrame[recvBufferWritePosition],
sizeof(struct can_frame));
// reset GPIO pin high
if (receivedBytes != 0)
{
if (receivedBytes == sizeof(struct can_frame))
{
recvBufferWritePosition++;
if (recvBufferWritePosition == CAN_MAX_RECEIVE_BUFFER_LENGTH)
{
recvBufferWritePosition = 0;
}
}
receivedBytes = 0;
}
} while (1);
}
The socket is configured in blocking mode, so the read function stays open
until a message arrived. The current implementation is working, but when
I measure the time between reading a message and the next waiting state of
the read function (see set/reset GPIO comment) the time varies between 30 us
(the mean value) and > 200 us. A value greather than 200us means
(CAN has a baud rate of 1000 kBit/s) that packages are not recognized while
the read() handles the previous message. The read() function must be ready within
134 us.
How can I accelerate my implementation? I tried to use two threads which are
separated with Mutexes (lock before the read() function and unlock after a
message reception), but this didn't solve my problem.

Can you choose a thread from a thread pool to execute (boost)

Here is some code i have atm.
int main()
{
boost::thread_group threads; // Thread Pool
// Here we create threads and kick them off by passing
// the address of the function to call
for (int i = 0; i < num_threads; i++)
threads.create_thread(&SendDataToFile);
threads.join_all();
system("PAUSE");
}
void SendDataToFile()
{
// The lock guard will make sure only one thread (client)
// will access this application at once
boost::lock_guard<boost::mutex> lock(io_mutex);
for (int i = 0; i < 5; i++)
cout << "Writing" << boost::this_thread::get_id() << endl;
}
At the moment im just using cout instead of writing to file.
Is it possible to actually choose a thread to carry out an operation before another thread. So i have a file i want to write to, 4 threads want to access that file at the same time, is it possible for me to say ok thread 2 you go first. ? in BOOST
can the fstream be used like cout? when i did write to a file the output was not messy (without a mutex)? but when i print to the console without a mutex it is messy as you would expect.
There are a number of equivalent ways you could do this using some combination of global variables protected by atomic updates, a mutex, semaphore, condition variable, etc. The way that seems to me to most directly communicate what you're trying to do is to have your threads wait on a ticket lock where instead of their ticket number representing the order that they arrived at the lock, it's chosen to be the order in which the threads were created. You could combine that idea with the Boost spinlock example for a simple and probably performant implementation.

Callback passed to boost::asio::async_read_some never invoked in usage where boost::asio::read_some returns data

I have been working on implementing a half duplex serial driver by learning from a basic serial terminal example using boost::asio::basic_serial_port:
http://lists.boost.org/boost-users/att-41140/minicom.cpp
I need to read asynchronously but still detect when the handler is finished in the main thread so I pass async_read_some a callback with several additional reference parameters in a lambda function using boost:bind. The handler never gets invoked but if I replace the async_read_some function with the read_some function it returns data without an issue.
I believe I'm satisfying all of the necessary requirements for this function to invoke the handler because they are the same for the asio::read some function which returns:
The buffer stays in scope
One or more bytes is received by the serial device
The io service is running
The port is open and running at the correct baud rate
Does anyone know if I'm missing another assumption unique to the asynchronous read or if I'm not setting up the io_service correctly?
Here is an example of how I'm using the code with async_read_some (http://www.boost.org/doc/libs/1_56_0/doc/html/boost_asio/reference/basic_serial_port/async_read_some.html):
void readCallback(const boost::system::error_code& error, size_t bytes_transfered, bool & finished_reading, boost::system::error_code& error_report, size_t & bytes_read)
{
std::cout << "READ CALLBACK\n";
std::cout.flush();
error_report = error;
bytes_read = bytes_transfered;
finished_reading = true;
return;
}
int main()
{
int baud_rate = 115200;
std::string port_name = "/dev/ttyUSB0";
boost::asio::io_service io_service_;
boost::asio::serial_port serial_port_(io_service_,port_name);
serial_port_.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
boost::thread service_thread_;
service_thread = boost::thread(boost::bind(&boost::asio::io_service::run, &io_service_));
std::cout << "Starting byte read\n";
boost::system::error_code ec;
bool finished_reading = false;
size_t bytes_read;
int max_response_size = 8;
uint8_t read_buffer[max_response_size];
serial_port_.async_read_some(boost::asio::buffer(read_buffer, max_response_size),
boost::bind(readCallback,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred,
finished_reading, ec, bytes_read));
std::cout << "Waiting for read to finish\n";
while (!finished_reading)
{
boost::this_thread::sleep(boost::posix_time::milliseconds(1));
}
std::cout << "Finished byte read: " << bytes_read << "\n";
for (int i = 0; i < bytes_read; ++i)
{
printf("0x%x ",read_buffer[i]);
}
}
The result is that the callback does not print out anything and the while !finished loop never finishes.
Here is how I use the blocking read_some function (boost.org/doc/libs/1_56_0/doc/html/boost_asio/reference/basic_serial_port/read_some.html):
int main()
{
int baud_rate = 115200;
std::string port_name = "/dev/ttyUSB0";
boost::asio::io_service io_service_;
boost::asio::serial_port serial_port_(io_service_,port_name);
serial_port_.set_option(boost::asio::serial_port_base::baud_rate(baud_rate));
boost::thread service_thread_;
service_thread = boost::thread(boost::bind(&boost::asio::io_service::run, &io_service_));
std::cout << "Starting byte read\n";
boost::system::error_code ec;
int max_response_size = 8;
uint8_t read_buffer[max_response_size];
int bytes_read = serial_port_.read_some(boost::asio::buffer(read_buffer, max_response_size),ec);
std::cout << "Finished byte read: " << bytes_read << "\n";
for (int i = 0; i < bytes_read; ++i)
{
printf("0x%x ",read_buffer[i]);
}
}
This version prints from 1 up to 8 characters that I send, blocking until at least one is sent.
The code does not guarantee that the io_service is running. io_service::run() will return when either:
All work has finished and there are no more handlers to be dispatched
The io_service has been stopped.
In this case, it is possible for the service_thread_ to be created and invoke io_service::run() before the serial_port::async_read_some() operation is initiated, adding work to the io_service. Thus, the service_thread_ could immediately return from io_service::run(). To resolve this, either:
Invoke io_service::run() after the asynchronous operation has been initiated.
Create a io_service::work object before starting the service_thread_. A work object prevents the io_service from running out of work.
This answer may provide some more insight into the behavior of io_service::run().
A few other things to note and to expand upon Igor's answer:
If a thread is not progressing in a meaningful way while waiting for an asynchronous operation to complete (i.e. spinning in a loop sleeping), then it may be worth examining if mixing synchronous behavior with asynchronous operations is the correct solution.
boost::bind() copies its arguments by value. To pass an argument by reference, wrap it with boost::ref() or boost::cref():
boost::bind(..., boost::ref(finished_reading), boost::ref(ec),
boost::ref(bytes_read));
Synchronization needs to be added to guarantee memory visibility of finished_reading in the main thread. For asynchronous operations, Boost.Asio will guarantee the appropriate memory barriers to ensure correct memory visibility (see this answer for more details). In this case, a memory barrier is required within the main thread to guarantee the main thread observes changes to finished_reading by other threads. Consider using either a Boost.Thread synchronization mechanism like boost::mutex, or Boost.Atomic's atomic objects or thread and signal fences.
Note that boost::bind copies its arguments. If you want to pass an argument by reference, wrap it with boost::ref (or std::ref):
boost::bind(readCallback, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred, boost::ref(finished_reading), ec, bytes_read));
(However, strictly speaking, there's a race condition on the bool variable you pass to another thread. A better solution would be to use std::atomic_bool.)

MPI Non-blocking Irecv didn't receive data?

I use MPI non-blocking communication(MPI_Irecv, MP_Isend) to monitor the slaves' idle states, the code is like bellow.
rank 0:
int dest = -1;
while( dest <= 0){
int i;
for(i=1;i<=slaves_num;i++){
printf("slave %d, now is %d \n",i,idle_node[i]);
if (idle_node[i]== 1) {
idle_node[i] = 0;
dest = i;
break;
}
}
if(dest <= 0){
MPI_Irecv(&idle_node[1],1,MPI_INT,1,MSG_IDLE,MPI_COMM_WORLD,&request);
MPI_Irecv(&idle_node[2],1,MPI_INT,2,MSG_IDLE,MPI_COMM_WORLD,&request);
MPI_Irecv(&idle_node[3],1,MPI_INT,3,MSG_IDLE,MPI_COMM_WORLD,&request);
// MPI_Wait(&request,&status);
}
usleep(100000);
}
idle_node[dest] = 0;//indicates this slave is busy now
rank 1,2,3:
while(1)
{
...//do something
MPI_Isend(&idle,1,MPI_INT,0,MSG_IDLE,MPI_COMM_WORLD,&request);
MPI_Wait(&request,&status);
}
it works, but I want it to be faster, so I delete the line:
usleep(100000);
then rank 0 goes into a dead while like this:
slave 1, now is 0
slave 2, now is 0
slave 3, now is 0
slave 1, now is 0
slave 2, now is 0
slave 3, now is 0
...
So does it indicate that when I use the MPI_Irecv, it just tells MPI I want to receive a message hereļ¼ˆhaven't received message), and MPI needs other time to receive the real data? or some reasons else?
The use of non-blocking operations has been discussed over and over again here. From the MPI specification (section Nonblocking Communication):
Similarly, a nonblocking receive start call initiates the receive operation, but does not complete it. The call can return before a message is stored into the receive buffer. A separate receive complete call is needed to complete the receive operation and verify that the data has been received into the receive buffer. With suitable hardware, the transfer of data into the receiver memory may proceed concurrently with computations done after the receive was initiated and before it completed.
(the bold text is copied verbatim from the standard; the emphasis in italic is mine)
The key sentence is the last one. The standard does not give any guarantee that a non-blocking receive operation will ever complete (or even start) unless MPI_WAIT[ALL|SOME|ANY] or MPI_TEST[ALL|SOME|ANY] was called (with MPI_TEST* setting a value of true for the completion flag).
By default Open MPI comes as a single-threaded library and without special hardware acceleration the only way to progress non-blocking operations is to either call periodically into some non-blocking calls (with the primary example of MPI_TEST*) or call into a blocking one (with the primary example being MPI_WAIT*).
Also your code leads to a nasty leak that will sooner or later result in resource exhaustion: you are calling MPI_Irecv multiple times with the same request variable, effectively overwriting its value and losing the reference to the previously started requests. Requests that are not waited upon are never freed and therefore remain in memory.
There is absolutely no need to use non-blocking operations in your case. If I understand the logic correctly, you can achieve what you want with code as simple as:
MPI_Recv(&dummy, 1, MPI_INT, MPI_ANY_SOURCE, MSG_IDLE, MPI_COMM_WORLD, &status);
idle_node[status.MPI_SOURCE] = 0;
If you'd like to process more than one worker processes at the same time, it is a bit more involving:
MPI_Request reqs[slaves_num];
int indices[slaves_num], num_completed;
for (i = 0; i < slaves_num; i++)
reqs[i] = MPI_REQUEST_NULL;
while (1)
{
// Repost all completed (or never started) receives
for (i = 1; i <= slaves_num; i++)
if (reqs[i-1] == MPI_REQUEST_NULL)
MPI_Irecv(&idle_node[i], 1, MPI_INT, i, MSG_IDLE,
MPI_COMM_WORLD, &reqs[i-1]);
MPI_Waitsome(slaves_num, reqs, &num_completed, indices, MPI_STATUSES_IGNORE);
// Examine num_completed and indices and feed the workers with data
...
}
After the call to MPI_Waitsome there will be one or more completed requests. The exact number will be in num_completed and the indices of the completed requests will be filled in the first num_completed elements of indices[]. The completed requests will be freed and the corresponding elements of reqs[] will be set to MPI_REQUEST_NULL.
Also, there appears to be a common misconception about using non-blocking operations. A non-blocking send can be matched by a blocking receive and also a blocking send can be equally matched by a non-blocking receive. That makes such constructs nonsensical:
// Receiver
MPI_Irecv(..., &request);
... do something ...
MPI_Wait(&request, &status);
// Sender
MPI_Isend(..., &request);
MPI_Wait(&request, MPI_STATUS_IGNORE);
MPI_Isend immediately followed by MPI_Wait is equivalent to MPI_Send and the following code is perfectly valid (and easier to understand):
// Receiver
MPI_Irecv(..., &request);
... do something ...
MPI_Wait(&request, &status);
// Sender
MPI_Send(...);

Multithreaded synchronization primitive

I have the following scenario:
I have multiple worker threads running that all go through a certain section of code, and they're allowed to do so simultaneously. No critical section surrounds this piece of code right now as it's not required for these threads.
I have a main thread that also -occassionally- wants to enter that section of code, but when it does, none of the other worker threads should use that section of code.
Naive solution: surround the section of code with a critical section. But that would kill a lot of parallelism between the worker threads, which is important in my case.
Is there a better solution?
Use RW locks. RW locks allow multiple readers and only a single writer. Your workers would call read-lock at the start of the critical section and the main thread would write-lock.
By definition, when calling read-lock, the calling process will wait for any writing threads to finish. When calling write-lock, the calling process will wait for any reading or writing threads to finish.
Example using POSIX threads:
pthread_rwlock_t lock;
/* worker threads */
void *do_work(void *args) {
for (int i = 0; i < 100; ++i) {
pthread_rwlock_rdlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
pthread_exit(0);
}
/* main thread */
int main(void) {
pthread_t workers[4];
pthread_rwlock_init(&lock);
int i;
// spawn workers...
for (i = 0; i < 4; ++i) {
pthread_create(workers[i]; NULL, do_worker, NULL);
}
for (i = 0; i < 100, ++i) {
pthread_rwlock_wrlock(&lock);
// do some work...
pthread_rwlock_unlock(&lock);
sleep(1);
}
return 0;
}
As far as I understand it, your worker threads are started asynchronously. So when the main thread wants to run this code section, you have to ensure that no worker thread is executing it. Therefore you have to stop all worker threads before the main thread can enter that code section, and allow them to enter it again afterwards.
This could be done - using Grand Central Dispatch - if your worker threads would be assigned to a dispatch group, see https://developer.apple.com/library/mac/#documentation/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html.
The main thread could then send the message dispatch_group_wait to this dispatch group, wait for all worker thread to leave this code section, execute it, and then requeue the worker threads.

Resources