When using iocp in a job/task pool to provide fast worker wake ups what is the best way to minimise the overhead of signalling the port - ie not having to do it every queue operation?
void Worker()
{
while(1)
{
for(int spin = 0; spin < 5000; ++spin)
while(queue.Count > 0)
queue.PopFront()();
WaitOnCompletionPort();
}
}
...
queue.PushBack(someWork);
// decide when to signal completion port but avoid doing it every queue operation ?
For example in the above rough code sketch there is a problem between work being queued and the wait being entered if you try and avoid signalling the port every queue operation.
Why don't you use the IOCP as your queue and post your work items directly to it? That way you get a thread safe queue for free and can completely remove the other queue you have?
This question would then go away ;)
Related
Is there a special "wait for event" function that can wait for 3 queues at the same time at device side so it doesn't wait for all queues serially from host side?
Is there a checkpoint command to send into a command queue such that it must wait for other command queues to hit same(vertically) barrier/checkpoint to wait and continue from device side so no host-side round-trip is needed?
For now, I tried two different versions:
clWaitForEvents(3, evt_);
and
int evtStatus0 = 0;
clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus0, NULL);
while (evtStatus0 > 0)
{
clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus0, NULL);
Sleep(0);
}
int evtStatus1 = 0;
clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus1, NULL);
while (evtStatus1 > 0)
{
clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus1, NULL);
Sleep(0);
}
int evtStatus2 = 0;
clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus2, NULL);
while (evtStatus2 > 0)
{
clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus2, NULL);
Sleep(0);
}
second one is a bit faster(I saw it from someone else) and both are executed after 3 flush commands.
Looking at CodeXL profiler results, first one waits longer between finish points and some operations don't even seem to be overlapping. Second one shows 3 finish points are all within 3 milliseconds so it is faster and longer parts are overlapped(read+write+compute at the same time).
If there is a way to achieve this with only 1 wait command from host side, there must a "flush" version of it too but I couldn't find.
Is there any way to achieve below picture instead of adding flushes between each pipeline step?
queue1 write checkpoint write checkpoint write
queue2 - compute checkpoint compute checkpoint compute
queue3 - checkpoint read checkpoint read
all checkpoints have to be vertically synchronized and all these actions must not start until a signal is given. Such as:
queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);
queue1.flush()
queue2.flush()
queue3.flush()
queue1.finish()
queue2.finish()
queue3.finish()
checkpoints are all handled in device side and only 3 finish commands are needed from host side(even better,only 1 finish for all queues?)
How I bind 3 queues to 3 events with "clWaitForEvents(3, evt_);" for now is:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[2]);
if this "enqueue barrier" can talk with other queues, how could I achieve that? Do I need to keep host-side events alive until all queues are finished or can I delete them or re-use them later? From the documentation, it seems like first barrier's event can be put to second queue and second one's barrier event can be put to third one along with first one's event so maybe it is like:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(evt_0, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(evt_0_and_1, &evt[2]);
in the end wait for only evt[2] maybe or using only 1 same event for all:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[2]);
where to get sameEvt object?
anyone tried this? Should I start all queues with a barrier so they dont start until I raise some event from host side or lazy-executions of "enqueue" is %100 trustable to "not to start until I flush/finish" them? How do I raise an event from host to device(sameEvt doesn't have a "raise" function, is it clCreateUserEvent?)?
All 3 queues are in-order type and are in same context. Out-of-order type is not supported by all graphics cards. C++ bindings are being used.
Also there are enqueueWaitList(is this deprecated?) and clEnqueueMarker but I don't know how to use them and documentation doesn't have any example in Khronos' website.
You asked too many questions and expressed too many variants to provide you with the only solution, so I will try to answer in general that you can figure out the most suitable solution.
If the queues are bind to the same context (possibly to different devices within the same context) than it is possible to synchronize them through the events. I.e. you can obtain an event from a command submitted to one queue and use this event to synchronize a command submitted to another queue, e.g.
queue1.enqueue(comm1, /*dependency*/ NULL, /*result event*/ &e1);
queue2.enqueue(comm2, /*dependency*/ &e1, /*result event*/ NULL);
In this example, comm2 will wait for comm1 completion.
If you need to enqueue commands first but no to allow them to be executed you can create user event (clCreateUserEvent) and signal it manually (clSetUserEventStatus). The implementation is allowed to process command as soon as they enqueued (the driver is not required to wait for the flush).
The barrier seems overkill for your purpose because it waits for all commands previously submitted to the queue. You can really use clEnqueueMarker that can be used to wait for all events and provide one event to be used for other commands.
As far as I know you can retain the event at any moment if you do not need it more. The implementation should prolong the event life-time if it is required for internal purposes.
I do not know what is enqueueWaitList.
Off-topic: if you need non-trivial dependencies between calculations you may want to consider TBB flow graph and opencl_node. The opencl_node uses events for syncronization and avoids "host-device" synchronizations if possible. However, it can be tricky to use multiple queues for the same device.
As far as I know, Intel HD Graphics 530 supports out-of-order queues (at least host-side).
You are making it much harder than it needs to be. On the write queue take an event. Use that as a condition for the compute on the compute queue, and take another event. Use that as a condition on the read on the read queue. There is no reason to force any other synchronization. Note: My interpretation of the spec is that you must clFlush on a queue that you took an event from before using that event as a condition on another queue.
Suppose I have two linux kernel threads, master thread and worker thread. Master uses kthread_run() to create worker. While worker is accepting socket connection and blocking, master calls kthread_stop() to stop worker.
Because worker is blocking on accepting operation and cannot exit, the kthread_stop() inside master will not return.
What should I do to kill worker thread from master in graceful way? Thanks.
You need to specify a timeout for the blocking socket the worker is reading from:
struct timeval tv;
tv.tv_sec = 0; /* 100 ms Timeout */
tv.tv_usec = 100000;
kernel_setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, (char *)&tv, sizeof(struct timeval));
Whenever the recv call returns you check for kthread_should_stop(). With 100ms timeout there should be almost zero polling overheads.
While reading the zeromq guide, I came across client code which sends 100k requests in a loop, and then receives the reply in a second loop.
#include "../include/mdp.h"
#include <time.h>
int main (int argc, char *argv [])
{
int verbose = (argc > 1 && streq (argv [1], "-v"));
mdp_client_t *session = mdp_client_new ("tcp://localhost:5555", verbose);
int count;
for (count = 0; count < 100000; count++) {
zmsg_t *request = zmsg_new ();
zmsg_pushstr (request, "Hello world");
mdp_client_send (session, "echo", &request);
}
printf("sent all\n");
for (count = 0; count < 100000; count++) {
zmsg_t *reply = mdp_client_recv (session,NULL,NULL);
if (reply)
zmsg_destroy (&reply);
else
break; // Interrupted by Ctrl-C
printf("reply received:%d\n", count);
}
printf ("%d replies received\n", count);
mdp_client_destroy (&session);
return 0;
}
I have added a counter to count the number of replies that the worker (test_worker.c) sends to the broker, and another counter in mdp_broker.c to count the number of replies the broker sends to a client. Both of these count up to 100k, but the client is receiving only around 37k replies.
If the number of client requests is set to around 40k, then it receives all the replies. Can someone please tell me why packets are lost when the client sends more than 40k asynchronous requests?
I tried setting the HWM to 100k for the broker socket, but the problem persists:
static broker_t *
s_broker_new (int verbose)
{
broker_t *self = (broker_t *) zmalloc (sizeof (broker_t));
int64_t hwm = 100000;
// Initialize broker state
self->ctx = zctx_new ();
self->socket = zsocket_new (self->ctx, ZMQ_ROUTER);
zmq_setsockopt(self->socket, ZMQ_SNDHWM, &hwm, sizeof(hwm));
zmq_setsockopt(self->socket, ZMQ_RCVHWM, &hwm, sizeof(hwm));
self->verbose = verbose;
self->services = zhash_new ();
self->workers = zhash_new ();
self->waiting = zlist_new ();
self->heartbeat_at = zclock_time () + HEARTBEAT_INTERVAL;
return self;
}
Without setting the HWM and using the default TCP settings, packet loss was being incurred with just 50k messages.
The following helped to mitigate the packet loss at the broker:
Setting the HWM for the zeromq socket.
Increasing the TCP send/receive buffer size.
This helped only up to a certain point. With two clients, each sending 100k messages, the broker was able to manage fine. But when the number of clients was increased to three, they stopped receiving all the replies.
Finally, what has helped me to ensure no packet loss is to change the design of the client code in the following way:
A client can send upto N messages at once. The client's RCVHWM and broker's SNDHWM should be sufficiently high to hold a total of N messages.
After that, for every reply received by the client, it sends two requests.
You send 100k messages, and then begin to receive them. Thus, the 100k messages should be stored in a buffer. When the buffer is exhausted and cannot store anymore messages, you reach the ZeroMQ's high water mark. Behaviour on high water mark is specified in ZeroMQ documentation.
In case of the above code, the broker may discard some of the messages since a majordomo broker uses the ROUTER socket. One of resolutions would be split the send/receive loops into separated threads
Why lost?
In ZeroMQ v2.1, a default value for ZMQ_HWM was INF (infinity), which helped the said test to be somewhat meaningful but at a cost of heavy risk of memory-overflow crashes, as the buffer allocation policy was not constrained / controlled so as to hit some physical limit.
As of ZeroMQ v3.0+, ZMQ_SNDHWM / ZMQ_RCVHWM default to 1000, which can be set afterwards.
You may also read an explicit warning, that
ØMQ does not guarantee that the socket will accept as many as ZMQ_SNDHWM messages, and the actual limit may be as much as 60-70% lower depending on the flow of messages on the socket.
Will splitting the sending / receiving part into separate threads help?
No.
Quick fix?
Yes, for the purpose of demo-test experimenting, set again infinite high-water marks, but be carefull to avoid such practice in any production-grade software.
Why to test a ZeroMQ performance in this way?
As said above, the original demo-test seems to have some meaning in its v2.1 implementation.
Since those days, ZeroMQ have evolved a lot. A very nice reading for your particular interest about performance envelopes, that may please building your further insight into this domain is in step by step guide with code examples on ZeroMQ protocol overheads/performance case-study on large file transfers
... we already run into a problem: if we send too much data to the ROUTER socket, we can easily overflow it. The simple but stupid solution is to put an infinite high-water mark on the socket. It's stupid because we now have no protection against exhausting the server's memory. Yet without an infinite HWM, we risk losing chunks of large files.
Try this: set the HWM to 1,000 (in ZeroMQ v3.x this is the default) and then reduce the chunk size to 100K so we send 10K chunks in one go. Run the test, and you'll see it never finishes. As the zmq_socket() man page says with cheerful brutality, for the ROUTER socket: "ZMQ_HWM option action: Drop".
We have to control the amount of data the server sends up-front. There's no point in it sending more than the network can handle. Let's try sending one chunk at a time. In this version of the protocol, the client will explicitly say, "Give me chunk N", and the server will fetch that specific chunk from disk and send it.
The best part, as far as I know, is in the commented progress of the resulting performance to the "model 3" flow-control and one can learn a lot from the great chapters and real-life remarks in the ZeroMQ Guide.
The problem:
To design an efficient and very fast named-pipes client server framework.
Current state:
I already have battle proven production tested framework. It is fast, however it uses one thread per one pipe connection and if there are many clients the number of threads could fast be to high. I already use smart thread pool (task pool in fact) that can scale with need.
I already use OVERLAPED mode for pipes, but then I block with WaitForSingleObject or WaitForMultipleObjects so that is why I need one thread per connection on the server side
Desired solution:
Client is fine as it is, but on the server side I would like to use one thread only per client request and not per connection. So instead of using one thread for the whole lifecycle of client (connect / disconnect) I would use one thread per task. So only when client requests data and no more.
I saw an example on MSDN that uses array of OVERLAPED structures and then uses WaitForMultipleObjects to wait on them all. I find this a bad design. Two problems I see here. First you have to maintain an array that can grow quite large and deletions will be costly. Second, you have a lot of events, one for each array member.
I also saw completion ports, like CreateIoCompletionPort and GetQueuedCompletionStatus, but I don't see how they are any better.
What I would like is something ReadFileEx and WriteFileEx do, they call a callback routine
when the operation is completed. This is a true async style of programming. But the problem is that ConnectNamedPipe does not support that and furthermore I saw that the thread needs to be in alertable state and you need to call some of the *Ex functions to have that.
So how is such a problem best solved?
Here is how MSDN does it: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365603(v=vs.85).aspx
The problem I see with this approach is that I can't see how you could have 100 clients connected at once if the limit to WaitForMultipleObjects is 64 handles. Sure I can disconnect the pipe after each request, but the idea is to have a permanent client connection just like in TCP server and to track the client through whole life-cycle with each client having unique ID and client specific data.
The ideal pseudo code should be like this:
repeat
// wait for the connection or for one client to send data
Result = ConnectNamedPipe or ReadFile or Disconnect;
case Result of
CONNECTED: CreateNewClient; // we create a new client
DATA: AssignWorkerThread; // here we process client request in a thread
DISCONNECT: CleanupAndDeleteClient // release the client object and data
end;
until Aborted;
This way we have only one listener thread that accepts connect / disconnect / onData events. Thread pool (worker thread) only process the actual request. This way 5 worker threads can serve a lot of clients that are connected.
P.S.
My current code should not be important. I code this in Delphi but its pure WinAPI so the language does not matter.
EDIT:
For now IOCP look like the solution:
I/O completion ports provide an efficient threading model for
processing multiple asynchronous I/O requests on a multiprocessor
system. When a process creates an I/O completion port, the system
creates an associated queue object for requests whose sole purpose is
to service these requests. Processes that handle many concurrent
asynchronous I/O requests can do so more quickly and efficiently by
using I/O completion ports in conjunction with a pre-allocated thread
pool than by creating threads at the time they receive an I/O request.
If server must handle more than 64 events (read/writes) then any solution using WaitForMultipleObjects becomes unfeasible. This is the reason the Microsoft introduced IO completion ports to Windows. It can handle very high number of IO operations using the most appropriate number of threads (usually it's the number of processors/cores).
The problem with IOCP is that it is very difficult to implement right. Hidden issues are spread like mines in the field: [1], [2] (section 3.6). I would recommend using some framework. Little googling suggests something called Indy for Delphi developers. There are maybe others.
At this point I would disregard the requirement for named pipes if that means coding my own IOCP implementation. It's not worth the grief.
I think what you're overlooking is that you only need a few listening named pipe instances at any given time. Once a pipe instance has connected, you can spin that instance off and create a new listening instance to replace it.
With MAXIMUM_WAIT_OBJECTS (or fewer) listening named pipe instances, you can have a single thread dedicated to listening using WaitForMultipleObjectsEx. The same thread can also handle the rest of the I/O using ReadFileEx and WriteFileEx and APCs. The worker threads would queue APCs to the I/O thread in order to initiate I/O, and the I/O thread can use the task pool to return the results (as well as letting the worker threads know about new connections).
The I/O thread main function would look something like this:
create_events();
for (index = 0; index < MAXIMUM_WAIT_OBJECTS; index++) new_pipe_instance(i);
for (;;)
{
if (service_stopping && active_instances == 0) break;
result = WaitForMultipleObjectsEx(MAXIMUM_WAIT_OBJECTS, connect_events,
FALSE, INFINITE, TRUE);
if (result == WAIT_IO_COMPLETION)
{
continue;
}
else if (result >= WAIT_OBJECT_0 &&
result < WAIT_OBJECT_0 + MAXIMUM_WAIT_OBJECTS)
{
index = result - WAIT_OBJECT_0;
ResetEvent(connect_events[index]);
if (GetOverlappedResult(
connect_handles[index], &connect_overlapped[index],
&byte_count, FALSE))
{
err = ERROR_SUCCESS;
}
else
{
err = GetLastError();
}
connect_pipe_completion(index, err);
continue;
}
else
{
fail();
}
}
The only real complication is that when you call ConnectNamedPipe it may return ERROR_PIPE_CONNECTED to indicate that the call succeeded immediately or an error other than ERROR_IO_PENDING if the call failed immediately. In that case you need to reset the event and then handle the connection:
void new_pipe(ULONG_PTR dwParam)
{
DWORD index = dwParam;
connect_handles[index] = CreateNamedPipe(
pipe_name,
PIPE_ACCESS_DUPLEX | FILE_FLAG_OVERLAPPED,
PIPE_TYPE_MESSAGE | PIPE_WAIT | PIPE_ACCEPT_REMOTE_CLIENTS,
MAX_INSTANCES,
512,
512,
0,
NULL);
if (connect_handles[index] == INVALID_HANDLE_VALUE) fail();
ZeroMemory(&connect_overlapped[index], sizeof(OVERLAPPED));
connect_overlapped[index].hEvent = connect_events[index];
if (ConnectNamedPipe(connect_handles[index], &connect_overlapped[index]))
{
err = ERROR_SUCCESS;
}
else
{
err = GetLastError();
if (err == ERROR_SUCCESS) err = ERROR_INVALID_FUNCTION;
if (err == ERROR_PIPE_CONNECTED) err = ERROR_SUCCESS;
}
if (err != ERROR_IO_PENDING)
{
ResetEvent(connect_events[index]);
connect_pipe_completion(index, err);
}
}
The connect_pipe_completion function would create a new task in the task pool to handle the newly connected pipe instance, and then queue an APC to call new_pipe to create a new listening pipe at the same index.
It is possible to reuse existing pipe instances once they are closed but in this situation I don't think it's worth the hassle.
I have the classic IOCP callback that dequeues i/o pending requests, process them, and deallocate them, in this way:
struct MyIoRequest { OVERLAPPED o; /* ... other params ... */ };
bool is_iocp_active = true;
DWORD WINAPI WorkerProc(LPVOID lpParam)
{
ULONG_PTR dwKey;
DWORD dwTrans;
LPOVERLAPPED io_req;
while(is_iocp_active)
{
GetQueuedCompletionStatus((HANDLE)lpParam, &dwTrans, &dwKey, (LPOVERLAPPED*)&io_req, WSA_INFINITE);
// NOTE, i could use GetQueuedCompletionStatusEx() here ^ and set it in the
// alertable state TRUE, so i can wake up the thread with an ACP request from another thread!
printf("dequeued an i/o request\n");
// [ process i/o request ]
...
// [ destroy request ]
destroy_request(io_req);
}
// [ clean up some stuff ]
return 0;
}
Then, in the code I will have somewhere:
MyIoRequest * io_req = allocate_request(...params...);
ReadFile(..., (OVERLAPPED*)io_req);
and this just works perfectly.
Now my question is: What about I want to immediately close the IOCP queue without causing leaks? (e.g. application must exit)
I mean: if i set is_iocp_active to 'false', the next time GetQueuedCompletionStatus() will dequeue a new i/o request, that will be the last i/o request: it will return, causing thread to exit and when a thread exits all of its pending i/o requests are simply canceled by the system, according to MSDN.
But the structures of type 'MyIoRequest' that I have instanced when calling ReadFile() won't be destroyed at all: the system has canceled pending i/o request, but I have to manually destroy those structures I have
created, or I will leak all pending i/o requests when I stop the loop!
So, how I could do this? Am I wrong to stop the IOCP loop with just setting that variable to false? Note that is would happen even if i use APC requests to stop an alertable thread.
The solution that come to my mind is to add every 'MyIoRequest' structures to a queue/list, and then dequeue them when GetQueuedCompletionStatusEx returns, but shouldn't that make some bottleneck, since the enqueue/dequeue process of such MyIoRequest structures must be interlocked? Maybe I've misunderstood how to use the IOCP loop. Can someone bring some light on this topic?
The way I normally shut down an IOCP thread is to post my own 'shut down now please' completion. That way you can cleanly shut down and process all of the pending completions and then shut the threads down.
The way to do this is to call PostQueuedCompletionStatus() with 0 for num bytes, completion key and pOverlapped. This will mean that the completion key is a unique value (you wont have a valid file or socket with a zero handle/completion key).
Step one is to close the sources of completions, so close or abort your socket connections, close files, etc. Once all of those are closed you can't be generating any more completion packets so you then post your special '0' completion; post one for each thread you have servicing your IOCP. Once the thread gets a '0' completion key it exits.
If you are terminating the app, and there's no overriding reason to not do so, (eg. close DB connections, interprocess shared memory issues), call ExitProcess(0).
Failing that, call CancelIO() for all socket handles and process all the cancelled completions as they come in.
Try ExitProcess() first!