3 queues + 1 finish or device-side checkpoints for all queues - events

Is there a special "wait for event" function that can wait for 3 queues at the same time at device side so it doesn't wait for all queues serially from host side?
Is there a checkpoint command to send into a command queue such that it must wait for other command queues to hit same(vertically) barrier/checkpoint to wait and continue from device side so no host-side round-trip is needed?
For now, I tried two different versions:
clWaitForEvents(3, evt_);
and
int evtStatus0 = 0;
clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus0, NULL);
while (evtStatus0 > 0)
{
clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus0, NULL);
Sleep(0);
}
int evtStatus1 = 0;
clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus1, NULL);
while (evtStatus1 > 0)
{
clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus1, NULL);
Sleep(0);
}
int evtStatus2 = 0;
clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus2, NULL);
while (evtStatus2 > 0)
{
clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus2, NULL);
Sleep(0);
}
second one is a bit faster(I saw it from someone else) and both are executed after 3 flush commands.
Looking at CodeXL profiler results, first one waits longer between finish points and some operations don't even seem to be overlapping. Second one shows 3 finish points are all within 3 milliseconds so it is faster and longer parts are overlapped(read+write+compute at the same time).
If there is a way to achieve this with only 1 wait command from host side, there must a "flush" version of it too but I couldn't find.
Is there any way to achieve below picture instead of adding flushes between each pipeline step?
queue1 write checkpoint write checkpoint write
queue2 - compute checkpoint compute checkpoint compute
queue3 - checkpoint read checkpoint read
all checkpoints have to be vertically synchronized and all these actions must not start until a signal is given. Such as:
queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);
queue1.flush()
queue2.flush()
queue3.flush()
queue1.finish()
queue2.finish()
queue3.finish()
checkpoints are all handled in device side and only 3 finish commands are needed from host side(even better,only 1 finish for all queues?)
How I bind 3 queues to 3 events with "clWaitForEvents(3, evt_);" for now is:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[2]);
if this "enqueue barrier" can talk with other queues, how could I achieve that? Do I need to keep host-side events alive until all queues are finished or can I delete them or re-use them later? From the documentation, it seems like first barrier's event can be put to second queue and second one's barrier event can be put to third one along with first one's event so maybe it is like:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(evt_0, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(evt_0_and_1, &evt[2]);
in the end wait for only evt[2] maybe or using only 1 same event for all:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[2]);
where to get sameEvt object?
anyone tried this? Should I start all queues with a barrier so they dont start until I raise some event from host side or lazy-executions of "enqueue" is %100 trustable to "not to start until I flush/finish" them? How do I raise an event from host to device(sameEvt doesn't have a "raise" function, is it clCreateUserEvent?)?
All 3 queues are in-order type and are in same context. Out-of-order type is not supported by all graphics cards. C++ bindings are being used.
Also there are enqueueWaitList(is this deprecated?) and clEnqueueMarker but I don't know how to use them and documentation doesn't have any example in Khronos' website.

You asked too many questions and expressed too many variants to provide you with the only solution, so I will try to answer in general that you can figure out the most suitable solution.
If the queues are bind to the same context (possibly to different devices within the same context) than it is possible to synchronize them through the events. I.e. you can obtain an event from a command submitted to one queue and use this event to synchronize a command submitted to another queue, e.g.
queue1.enqueue(comm1, /*dependency*/ NULL, /*result event*/ &e1);
queue2.enqueue(comm2, /*dependency*/ &e1, /*result event*/ NULL);
In this example, comm2 will wait for comm1 completion.
If you need to enqueue commands first but no to allow them to be executed you can create user event (clCreateUserEvent) and signal it manually (clSetUserEventStatus). The implementation is allowed to process command as soon as they enqueued (the driver is not required to wait for the flush).
The barrier seems overkill for your purpose because it waits for all commands previously submitted to the queue. You can really use clEnqueueMarker that can be used to wait for all events and provide one event to be used for other commands.
As far as I know you can retain the event at any moment if you do not need it more. The implementation should prolong the event life-time if it is required for internal purposes.
I do not know what is enqueueWaitList.
Off-topic: if you need non-trivial dependencies between calculations you may want to consider TBB flow graph and opencl_node. The opencl_node uses events for syncronization and avoids "host-device" synchronizations if possible. However, it can be tricky to use multiple queues for the same device.
As far as I know, Intel HD Graphics 530 supports out-of-order queues (at least host-side).

You are making it much harder than it needs to be. On the write queue take an event. Use that as a condition for the compute on the compute queue, and take another event. Use that as a condition on the read on the read queue. There is no reason to force any other synchronization. Note: My interpretation of the spec is that you must clFlush on a queue that you took an event from before using that event as a condition on another queue.

Related

When should I use a wait function like MsgWaitForMultipleObjects, and when shouldn't I?

So I am trying to understand the message processing code of Unreal Engine on Windows OS, and I didn't find any frequent usage of the function MsgWaitForMultipleObjects or MsgWaitForMultipleObjectsEx in the message pumping code.
The engine message pumping goes like this:
MSG Message;
// standard Windows message handling
while(PeekMessage(&Message, NULL, 0, 0, PM_REMOVE))
{
TranslateMessage(&Message);
DispatchMessage(&Message);
}
For context, this code will run every frame one to three times, meaning the code will be executed each 2 - 5 milliseconds on average throughout the running time of the application. A) Does that make wait functions unnecessary? or am I missing something here!
B) Is there any rough estimation of how long an application could be busy doing 'other stuff' before processing incoming messages? For instance if an application only processes messages every 50 millisecond, is that a bad practice? or is that a reasonable way of doing it? And what if the period became 500 milliseconds and so?
Use MsgWaitForMultipleObjects/etc if you need to both handle window message processing and kernel handle or alertable waits in a single thread. If you are only doing message processing then simply use a normal GetMessage based message loop, if only doing kernel handle or alertable waits then use WaitForMultipleObjects as appropriate.

zeromq: ZMQ_CONFLATE==1 does not stop queues from saving old messages

With ZeroMQ and CPPZMQ 4.3.2, I want to drop old messages for all my sockets including
PAIR
Pub/Sub
REQ/REP
So I use m_socks[channel].setsockopt(ZMQ_CONFLATE, 1) on all my sockets before binding/connecting.
Test
However, when I made the following test, it seems that the old messages are still flushed out on each reconnection. In this test,
I use a thread to keep sending generated sinewave to a receiver thread
Every 10 seconds I double the sinewave's frequency
Then after 10 seconds I stop the process
Below is the pseudocode of the sender
// on sender end
auto thenSec = high_resolution_clock::now();
while(m_isRunning) {
// generate sinewave, double the frequency every 10s or so
auto nowSec = high_resolution_clock::now();
if (duration_cast<seconds>(nowSec - thenSec).count() > 10) {
m_sine.SetFreq(m_sine.GetFreq()*2);
thenSec = nowSec;
}
m_sine.Generate(audio);
// send to rendering thread
m_messenger.send("inproc://sound-ear.pair",
(const void*)(audio),
audio_size,
zmq::send_flags::dontwait
);
}
Note that I already use DONTWAIT to mitigate blocking.
On the receiver side I have a zmq::poller_event handler that simply receives the last message on event polling.
In the stop sequence I reset the sinewave frequency to its lowest value, say, 440Hz.
Expected
The expected behaviour would be:
If I stop both the sender and the receiver after 10s when the frequency is doubled,
and I restart both,
then I should see the sinewave reset to 440Hz.
Observed
But the observed behaviour is that the received sinewave is still of the doubled frequency after restarting the communication, i.e., 880Hz.
Question
Am I doing it wrong or should I use some kind of killswitch to force drop all messages in this case?
OK, I think I solved it myself. Kind of.
Actual solution
I finally realized that the behaviour I want is to flush all messages when I stop the rendering. According to the official doc(How can I flush all messages that are in the ZeroMQ socket queue?), this can only be achieved by
set the sockets of both sender's and receiver's ZMQ_LINGER option to 0, meaning to keep nothing on closing those sockets;
closing the sockets on both sender and receiver ends, which also involves bootstrapping pollers and all references to the sockets.
This seems a lot of unnecessary work if I'm to restart rendering my data again, right after the stop sequence. But I found no other way to solve this cleanly.
Initial effort
It seems to me that ZMQ_CONFLATE does not make a difference on PAIR. I really have to tweak high water marks on sender and receiver ends using ZMQ_SNDHWM and ZMQ_RCVHWM.
However, I said "kind of solved" because tweaking HWM in the end is not the optimal solution for a realtime application,
having ZMQ_SNDHWM / ZMQ_RCVHWM set to the minimum "1", we still have a sizable latency in terms of realtime.
Also, the consumer thread could fall into underrun situatioin, i.e., perceivable jitters with the lowest HWM.
If I'm not doing anything wrong, I guess the optimal solution would still be shared memory for my targeted scenario. This is sad because I really enjoyed the simplicity of ZMQ's multicast messaging patternsand hate to deal with thread locking littered everywhere.

How can i terminate myself if i run too long?

I have a application that runs periodically (it's a scheduled task). The task is launched once a minute, and normally only takes a few seconds to do its business, then exits.
But there's a ~1 in 80,000 chance (every two or three months) that the application will hang. The root cause is because we're using Microsoft ServerXmlHttpRequest component to perform some work, and sometimes it just decides to hang. The virtue of ServerXmlHttpRequest over XmlHttpRequest is that the latter is not recommended for important scenarios, such as where reliability and security are important (which is true of an unattended server component):
The ServerXMLHTTP object offers functionality similar to that of the XMLHTTP object. Unlike XMLHTTP, however, the ServerXMLHTTP object does not rely on the WinInet control for HTTP access to remote XML documents. ServerXMLHTTP uses a new HTTP client stack. Designed for server applications, this server-safe subset of WinInet offers the following advantages:
Reliability — The HTTP client stack offers longer uptimes. WinInet features that are not critical for server applications, such as URL caching, auto-discovery of proxy servers, HTTP/1.1 chunking, offline support, and support for Gopher and FTP protocols are not included in the new HTTP subset.
Security — The HTTP client stack does not allow a user-specific state to be shared with another user's session. ServerXMLHTTP provides support for client certificates.
The job is being run as a scheduled task. I need the task to continue to run periodically; killing the existing process if it's dead.
The Windows Task Scheduler does have an option for forcibly close a task that is running too long:
The only downside to that approach is that it simply doesn't work - it simply does not stop the task. The hung process keeps running.
Given that i cannot trust the Microsoft ServerXmlHttpRequest to not arbitrarily lock up, and the task scheduler is unable to terminate the scheduled task, i need some way to do it myself.
Jobs
I tried looking into using the Job Objects API:
A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. A job can enforce limits such as working set size, process priority, and end-of-job time limit on each process that is associated with the job.
That one note sounded like exactly what i needed:
A job can enforce limits such as end-of-job time limit on each process that is associated with the job.
The only down-side to that approach is that it does not work. Job cannot impose a time-limit on a process. They can only impose a user time limit on a process:
PerProcessUserTimeLimit
If LimitFlags specifies JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process user-mode execution time limit, in 100-nanosecond ticks.
If the process is idle (for example, sitting at a MsgWaitForSingleObject as ServerXmlHttpRequest is), then it will accumulate no user time. I tested it. I created a job with a 1 second time limit, and placed my self process into it. As long as i don't move the mouse around my test application, it quite happily sits there for longer than one second.
Watchdog Thread
The only other technique i can imagine, given that my main thread is indefinitely blocked, is another thread. The only solution i can imagine is spawn another thread that will sleep for my three minutes, then ExitProcess:
Int32 watchdogTimeoutSeconds = FindCmdLineSwitch("watchdog", 0);
if (watchdogTimeoutSeconds > 0)
Thread thread = new Thread(KillMeCallback, new IntPtr(watchdogTimeoutSeconds));
void KillMeCallback(IntPtr data)
{
Int32 secondsUntilProcessIsExited = data.ToInt32();
if (secondsUntilProcessIsExited <= 0)
return;
Sleep(secondsUntilProcessIsExited*1000); //seconds --> milliseconds
LogToEventLog(ExtractFilename(Application.ExeName),
"Watchdog fired after "+secondsUntilProcessIsExited.ToString()+" seconds. Process will be forcibly exited.", EVENTLOG_WARNING_TYPE, 999);
ExitProcess(999);
}
And that works. The only downside is that it's a bad idea.
Can anyone think of anything better?
Edit
For now i will implement a
Contoso.exe /watchdog 180
So the process will be exited after 180 seconds. It means the duration is configurable, or can be removed completely easily in the field.
I used the route where i pass a special WatchDog argument to my process on the command line;
>Contoso.exe /watchdog 180
During initialization i check for the presence of the WatchDog option, with an integer number of seconds after it:
String s = Toolkit.FindCmdLineOption("watchdog", ["/", "-"]);
if (s <> "")
{
Int32 seconds = StrToIntDef(s, 0);
if (seconds > 0)
RunInThread(WatchdogThreadProc, Pointer(seconds));
}
and my thread procedure:
void WatchdogProc(Pointer Data);
{
Int32 secondsUntilProcessIsExited = Int32(Data);
if (secondsUntilProcessIsExited <= 0)
return;
Sleep(secondsUntilProcessIsExited*1000); //seconds -> milliseconds
LogToEventLog(ExtractFileName(ParamStr(0)),
Format("Watchdog fired after %d seconds. Process will be forcibly exited.", secondsUntilProcessIsExited),
EVENTLOG_WARNING_TYPE, 999);
ExitProcess(2);
}

Checking Win32 file streams for available input

I have a simple tunnel program that needs to simultaneously block on standard input and a socket. I currently have a program that looks like this (error handling and boiler plate stuff omitted):
HANDLE host = GetStdHandle(STD_INPUT_HANDLE);
SOCKET peer = ...; // socket(), connect()...
WSAEVENT gate = WSACreateEvent();
OVERLAPPED xfer;
ZeroMemory(&xfer, sizeof(xfer));
xfer.hEvent = gate;
WSABUF pbuf = ...; // allocate memory, set size.
// start an asynchronous transfer.
WSARecv(peer, &pbuf, 1, 0, &xfer, 0);
while ( running )
{
// wait until standard input has available data or the event
// is signaled to inform that socket read operation completed.
HANDLE handles[2] = { host, gate };
const DWORD which = WaitForMultipleObjects
(2, handles, FALSE, INFINITE) - WAIT_OBJECT_0;
if (which == 0)
{
// read stuff from standard input.
ReadFile(host, ...);
// process stuff received from host.
// ...
}
if (which == 1)
{
// process stuff received from peer.
// ...
// start another asynchronous transfer.
WSARecv(peer, &pbuf, 1, 0, &xfer, 0);
}
}
The program works like a charm, I can transfer stuff through this tunnel program without a hitch. The thing is that it has a subtle bug.
If I start this program in interactive mode from cmd.exe and standard input is attached to the keyboard, pressing a key that does not produce input (e.g. the Ctrl key) makes this program block and ignore data received on the socket. I managed to realize that this is because pressing any key signals the standard input handle and WaitForMultipleObjects() returns. As expected, control enters the if (which == 0) block and the call to ReadFile() blocks because there is no input available.
Is there a means to detect how much input is available on a Win32 stream? If so, I could use this to check if any input is available before calling ReadFile() to avoid blocking.
I know of a few solutions for specific types of streams (notably ClearCommError() for serial ports and ioctlsocket(socket,FIONBIO,&count) for sockets), but none that I know of works with the CONIN$ stream.
Use overlapped I/O. Then test the event attached to the I/O operation, instead of the handle.
For CONIN$ specifically, you might also look at the Console Input APIs, such as PeekConsoleInput and GetNumberOfConsoleInputEvents
But I really recommend using OVERLAPPED (background) reads wherever possible and not trying to treat WaitForMultipleObjects like select.
Since the console can't be overlapped in overlapped mode, your simplest options are to wait on the console handle and use ReadConsoleInput (then you have to process control sequences manually), or spawn a dedicated worker thread for synchronous ReadFile. If you choose a worker thread, you may want to then connect a pipe between that worker and the main I/O loop, using overlapped pipe reads.
Another possibility, which I've never tried, would be to wait on the console handle and use PeekConsoleInput to find out whether to call ReadFile or ReadConsoleInput. That way you should be able to get non-blocking along with the cooked terminal processing. OTOH, passing control sequences to ReadConsoleInput might inhibit the buffer-manipulation actions they were supposed to take.
If the two streams are processed independently, or nearly so, it may make more sense to start a thread for each one. Then you can use a blocking read from standard input.

Problem with Boost Asio asynchronous connection using C++ in Windows

Using MS Visual Studio 2008 C++ for Windows 32 (XP brand), I try to construct a POP3 client managed from a modeless dialog box.
Te first step is create a persistent object -say pop3- with all that Boost.asio stuff to do asynchronous connections, in the WM_INITDIALOG message of the dialog-box-procedure. Some like:
case WM_INITDIALOG:
return (iniPop3Dlg (hDlg, lParam));
Here we assume that iniPop3Dlg() create the pop3 heap object -say pointed out by pop3p-. Then connect with the remote server, and a session is initiated with the client’s id and password (USER and PASS commands). Here we assume that the server is in TRANSACTION state.
Then, in response to some user input, the dialog-box-procedure, call the appropriate function. Say:
case IDS_TOTAL: // get how many emails in the server
total (pop3p);
return FALSE;
case IDS_DETAIL: // get date, sender and subject for each email in the server
detail (pop3p);
return FALSE;
Note that total() uses the POP3’s STAT command to get how many emails in the server, while detail() uses two commands consecutively; first STAT to get the total and then a loop with the GET command to retrieve the content of each message.
As an aside: detail() and total() share the same subroutines -the STAT handle routine-, and when finished, both leaves the session as-is. That is, without closing the connection; the socket remains opened an the server in TRANSACTION state.
When any option is selected by the first time, the things run as expected, obtaining the desired results. But when making the second chance, the connection hangs.
A closer inspection show that the first time that the statement
socket_.get_io_service().run();
Is used, never ends.
Note that all asynchronous write and read routines uses the same io_service, and each routine uses socket_.get_io_service().reset() prior to any run()
Not also that all R/W operations also uses the same timer, who is reseted to zero wait after each operation is completed:
dTimer_.expires_from_now (boost::posix_time::seconds(0));
I suspect that the problem is in the io_service or in the timer, and the fact that subsequent executions occurs in a different load of the routine.
As a first approach to my problem, I hope that someone would bring some light in it, prior to a more detailed exposition of the -very few and simple- routines involved.
Have you looked at the asio examples and studied them? There are several asynchronous examples that should help you understand the basic control flow. Pay particular importance to the main event loop started by invoking io_service::run, it's important to understand control is not expected to return to the caller until the io_service has no more remaining work to do.

Resources