So my problem stems from I thought I'd be awfully clever and try and model hardware in Golang. Yes, use it for the sort of thing that in my day job I write in Verilog.
To start off I'm trying to do the simplest possible hardware pipeline. i.e. a data source that pushes data to a Reg stage which pushes it to the next Reg which pushes it to a sink stage. Things can only advance on the clock. The problem is the event manager seems to have it in for me.
Starting at the data being sent on a clock
select {
case out_chan <- tmp:
front_chan = saved_front_chan
out_chan = nil
fmt.Println("Sent Data, ", tmp, name)
default:
c.ckwg.Done() // Send failed so remove token
if out_chan != nil {
log.Fatal("Stalled, Can't send ", tmp, name)
} else {
fmt.Println("Nothing to send ", name)
}
}
i.e. if we can send the data, do, if we can't then fine; but if we can't send data then the front end receiver (see main code) doesn't get re-enabled for reception.
This is sending to a front end which is a bit simpler
for itm := range in_chan {
c.ckwg.Done()
fmt.Println("Front Channel received", itm, name)
saved_front_chan <- itm
}
Note the use of a sync.WaitGroup to make sure that this code is run before the end of the clock evaluation phase.
So far so good. The problem is sometimes one of these front ends will have sucessfully "saved_front_chan <- itm" but will not have got around back to listening on in_chan.
This will mean that the previous stage in the pipeline will think it is stalled.
Now this would be fine in a data processing pipeline, but for something that's trying to model hardware to some degree of accuracy it is not. (For this test case I'm modelling a pipeline that tries to support stalls but has no need to stall, so doesn't) As far as I can see though there are no tricks left to me to force the scheduler to have the loop ready to read. Fundamentally I need to know the difference between "Channel can't receive because of pipeline back pressure" and "Channel can't receive because the scheduler hasn't got around to that routine yet."
So the full code is here:
https://github.com/cbehopkins/cbhdl/blob/master/pipe_test.go
For context, the specification I am actually trying to meet is a stage that only advances the data on a clock pulse, and that clock pulse is supplied to all stages "Simultaneously". Yes I know nothing is really in parallel but I'm trying to fake that behavior. Specifically it must appear that the inputs are sampled at the start of the clock phase and then the data is output at the end of the clock phase. Crucially in the faking of the parallelism it must not matter which order the reg stages receive their clock pulses. I have also experimented with the flow control not being done by golang channel back pressure but by a separate token system - but that got stupendously complex and error prone.
FWIW I have another trial version of this where the broadcast of the clock pulse is done with sync package broadcast structure - but that code got in the way of demonstrating the problem I'm having.
Thanks in advance.
Related
Besides a few tutorials on Go I have no actual experience in it. I'm trying to take a project written in Go and converting it into a windows service.
I honestly haven't tried anything besides trying to find things to read over. I have found a few threads and choosen the best library I felt covered all of our needs
https://github.com/golang/sys
// Copyright 2012 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// +build windows
package main
import (
"fmt"
"strings"
"time"
"golang.org/x/sys/windows/svc"
"golang.org/x/sys/windows/svc/debug"
"golang.org/x/sys/windows/svc/eventlog"
)
var elog debug.Log
type myservice struct{}
func (m *myservice) Execute(args []string, r <-chan svc.ChangeRequest, changes chan<- svc.Status) (ssec bool, errno uint32) {
const cmdsAccepted = svc.AcceptStop | svc.AcceptShutdown | svc.AcceptPauseAndContinue
changes <- svc.Status{State: svc.StartPending}
fasttick := time.Tick(500 * time.Millisecond)
slowtick := time.Tick(2 * time.Second)
tick := fasttick
changes <- svc.Status{State: svc.Running, Accepts: cmdsAccepted}
loop:
for {
select {
case <-tick:
beep()
elog.Info(1, "beep")
case c := <-r:
switch c.Cmd {
case svc.Interrogate:
changes <- c.CurrentStatus
// Testing deadlock from https://code.google.com/p/winsvc/issues/detail?id=4
time.Sleep(100 * time.Millisecond)
changes <- c.CurrentStatus
case svc.Stop, svc.Shutdown:
// golang.org/x/sys/windows/svc.TestExample is verifying this output.
testOutput := strings.Join(args, "-")
testOutput += fmt.Sprintf("-%d", c.Context)
elog.Info(1, testOutput)
break loop
case svc.Pause:
changes <- svc.Status{State: svc.Paused, Accepts: cmdsAccepted}
tick = slowtick
case svc.Continue:
changes <- svc.Status{State: svc.Running, Accepts: cmdsAccepted}
tick = fasttick
default:
elog.Error(1, fmt.Sprintf("unexpected control request #%d", c))
}
}
}
changes <- svc.Status{State: svc.StopPending}
return
}
func runService(name string, isDebug bool) {
var err error
if isDebug {
elog = debug.New(name)
} else {
elog, err = eventlog.Open(name)
if err != nil {
return
}
}
defer elog.Close()
elog.Info(1, fmt.Sprintf("starting %s service", name))
run := svc.Run
if isDebug {
run = debug.Run
}
err = run(name, &myservice{})
if err != nil {
elog.Error(1, fmt.Sprintf("%s service failed: %v", name, err))
return
}
elog.Info(1, fmt.Sprintf("%s service stopped", name))
}
So I spent some time going over this code. Tested it out to see what it does. It performs as it should.
The question I have is we currently have a Go program that takes in arguments and for our service we pass in server. Which spins up our stuff on a localhost webpage.
I believe the code above may have something to do with that but I'm lost at how I would actually get it spin off our exe with the correct arguements. Is this the right spot to call main?
Im sorry if this is vague. I dont know exactly how to make this interact with our already exisiting exe.
I can get that modified if I know what needs to be changed. I appreacite any help.
OK, that's much clearer now. Well, ideally you should start with some tutorial on what constitutes a Windows service—I bet tihis might have solved the problem for you. But let's try anyway.
Some theory
A Windows service sort of has two facets: it performs some useful task and it communicates with the SCM facility. When you manipulate a service using the sc command or through the Control Panel, you have that piece of software to talk with SCM on your behalf, and SCM talks with that service.
The exact protocol the SCM and a service use is low-level and complicated
and the point of the Go package you're using is to hide that complexity from you
and offer a reasonably Go-centric interface to that stuff.
As you might gather from your own example, the Execute method of the type you've created is—for the most part—concerned with communicating with SCM: it runs an endless for loop which on each iteration sleeps on reading from the r channel, and that channel delivers SCM commands to your service.
So you basically have what could be called "an SCM command processing loop".
Now recall those two facets above. You already have one of them: your service interacts with SCM, so you need another one—the code which actually performs useful tasks.
In fact, it's already partially there: the example code you've grabbed creates a time ticker which provides a channel on which it delivers a value when another tick passes. The for loop in the Execute method reads from that channel as well, "doing work" each time another tick is signalled.
OK, this is fine for a toy example but lame for a real work.
Approaching the solution
So let's pause for a moment and think about our requirements.
We need some code running and doing our actual task(s).
We need the existing command processing loop to continue working.
We need these two pieces of code to work concurrently.
In this toy example the 3rd point is there "for free" because a time ticker carries out the task of waiting for the next tick automatically and fully concurrently with the rest of the code.
Your real code most probably won't have that luxury, so what do you do?
In Go, when you need to do something concurrently with something else,
an obvious answer is "use a goroutine".
So the first step is to grab your existing code, turn it into a callable function
and then call it in a separate goroutine right before entering the for loop.
This way, you'll have both pieces run concurrently.
The hard parts
OK, that wasn't hard.
The hard parts are:
How to configure the code which performs the tasks.
How to make the SCM command processing loop and the code carrying out tasks communicate.
Configuration
This one really depends on the policies at your $dayjob or of your $current_project, but there are few hints:
A Windows service may receive command-line arguments—either for a single run or permanently (passed to the service on each of its runs).
The downside is that it's not convenient to work with them from the UI/UX standpoint.
Typically Windows services used to read the registry.
These days (after the advent of .NET and its pervasive xml-ity) the services tend to read configuration files.
The OS environment most of the time is a bad fit for the task.
You may combine several of these venues.
I think I'd start with a configuration file but then again, you should pick the path of the least resistance, I think.
One of the things to keep in mind is that the reading and processing of the configuration should better be done before the service signals the SCM it started OK: if the configuration is invalid or cannot be loaded, the service should extensively log that and signal it failed, and not run the actual task processing code.
Communication between the command processing loop and the tasks carrying code
This is IMO the hardest part.
It's possible to write a whole book here but let's keep it simple for now.
To make it as simple as possible I'd do the following:
Consider pausing, stopping and shutting down mostly the same: all these signals must tell your task processing code to quit, and then wait for it to actually do that.
Consider the "continue" signal the same as starting the task processing function: run it again—in a new goroutine.
Have a one-directional communication: from the control loop to the tasks processing code, but not the other way—this will greatly simplify service state management.
This way, you may create a single channel which the task processing code listens on—or checks periodically, and when a value comes from that channel, the code stops running, closes the channel and exits.
The control loop, when the SCM tells it to pause or stop or shut down, sends anything on that channel then waits for it to close. When that happens, it knows the tasks processing code is finished.
In Go, a paradigm for a channel which is only used for signaling, is to have a channel of type struct{} (an empty struct).
The question of how to monitor this control channel in the tasks running code is an open one and heavily depends on the nature of the tasks it performs.
Any further help here would be reciting what's written in the Go books on concurrency so you should have that covered first.
There's also an interesting question of how to have the communication between the control loop and the tasks processing loop resilient to the possible processing stalls in the latter, but then again, IMO it's too early to touch upon that.
Let me give a brief context first:
I have a scenario where the RSUs will broadcast a fixed message 'RSUmessage' about every TRSU seconds. I have implemented the following code for RSU broadcast (also, these fixed messages have Psid = -100 to differentiate them from others):
void TraCIDemoRSU11p::handleSelfMsg(cMessage* msg) {
if (WaveShortMessage* wsm = dynamic_cast<WaveShortMessage*>(msg)) {
if(wsm->getPsid()==-100){
sendDown(RSUmessage->dup());
scheduleAt(simTime() + trsu +uniform(0.02, 0.05), RSUmessage);
}
}
else {
BaseWaveApplLayer::handleSelfMsg(wsm);
}
}
A car can receive these messages from other cars as well as RSUs. RSUs discard the messages received from cars. The cars will receive multiple such messages, do some comparison stuff and periodically broadcast a similar type of message : 'aggregatedMessage' per interval Tcar. aggregatedMessage also have Psid=-100 ,so that the message can be differentiated from other messages easily.
I am scheduling the car events using self messages. (Though it could have been done inside handlePositionUpdate I believe). The handleSelfMsg of a car is following:
void TraCIDemo11p::handleSelfMsg(cMessage* msg) {
if (WaveShortMessage* wsm = dynamic_cast<WaveShortMessage*>(msg)) {
wsm->setSerial(wsm->getSerial() +1);
if (wsm->getPsid() == -100) {
sendDown(aggregatedMessage->dup());
//sendDelayedDown(aggregatedMessage->dup(), simTime()+uniform(0.1,0.5));
scheduleAt(simTime()+tcar+uniform(0.01, 0.05), aggregatedMessage);
}
//send this message on the service channel until the counter is 3 or higher.
//this code only runs when channel switching is enabled
else if (wsm->getSerial() >= 3) {
//stop service advertisements
stopService();
delete(wsm);
}
else {
scheduleAt(simTime()+1, wsm);
}
}
else {
BaseWaveApplLayer::handleSelfMsg(msg);
}
}
PROBLEM: With this setting, the simulation is very very slow. I get about 50 simulation seconds in 5-6 hours or more in Express mode in OMNET GUI. (No. of RSU: 64, Number of Vehicle: 40, around 1kmx1km map)
Also, I am referring to this post. The OP says that he got faster speed by removing the sending of message after each RSU received a message. In my case I cannot remove that, because I need to send out the broadcast messages after each interval.
Question: I think that this slowness is because every node is trying to sendDown messages at the beginning of each simulated second. Is it the case that when all vehicles and nodes schedules and sends message at the same time OMNET slows down? (Makes sense to slow down , but by what degree) But, there are only around 100 nodes overall in the simulation. Surely it cannot be this slow.
What I tried : I tried using sendDelayedDown(wsm->dup(), simTime()+uniform(0.1,0.5)); to spread the sending of the messages through out 1st half of each simulated second. This seems to stop messages piling up at the beginning of each simulation seconds and sped things up a bit, but not so much overall.
Can anybody please let me know if this is normal behavior or whether I am doing something wrong.
Also I apologize for the long post. I could not explain my problem without giving the context.
It seems you are flooding your network with messages: Every message from an RSU gets duplicated and transmitted again by every Car which has received this message. Hence, the computational time increases quadratically with the number of nodes (sender of messages) in your network (every sent message has to be handled by every node which is in range to receive it). The limit of 3 transmissions per message does not seem to help much and, as the comment in the code indicates, is not used at all, if there is no channel switching.
Therefore, if you can not improve/change your code to simply send less messages, you have to live with that. Your little tweak to send the messages in a delayed manner only distributes the messages over one second but does not solve the problem of flooding.
However, there are still some hints you can follow to improve the performance of your simulation:
Compile in release mode: make MODE=Release
Run your simulation in the terminal environment Cmdenv: ./run -u Cmdenv ...
If you need to use the graphical environment by all means, you should at least speed up the animations by using the slider in the upper part of the interface.
Removing the simtime-resolution parameter from the omnetpp.ini file solves the problem.
It seems the simulation kernel has an issue when the channel delay does not match the simulation-time resolution.
You can verify the solution by cloning the following repository. Note that you need a functional installation of the OMNeT++ framework. Specifically, I test this fix in OMNeT++ 5.6.2.
https://github.com/Ryuuba/flooding
Is there a special "wait for event" function that can wait for 3 queues at the same time at device side so it doesn't wait for all queues serially from host side?
Is there a checkpoint command to send into a command queue such that it must wait for other command queues to hit same(vertically) barrier/checkpoint to wait and continue from device side so no host-side round-trip is needed?
For now, I tried two different versions:
clWaitForEvents(3, evt_);
and
int evtStatus0 = 0;
clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus0, NULL);
while (evtStatus0 > 0)
{
clGetEventInfo(evt_[0], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus0, NULL);
Sleep(0);
}
int evtStatus1 = 0;
clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus1, NULL);
while (evtStatus1 > 0)
{
clGetEventInfo(evt_[1], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus1, NULL);
Sleep(0);
}
int evtStatus2 = 0;
clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus2, NULL);
while (evtStatus2 > 0)
{
clGetEventInfo(evt_[2], CL_EVENT_COMMAND_EXECUTION_STATUS,
sizeof(cl_int), &evtStatus2, NULL);
Sleep(0);
}
second one is a bit faster(I saw it from someone else) and both are executed after 3 flush commands.
Looking at CodeXL profiler results, first one waits longer between finish points and some operations don't even seem to be overlapping. Second one shows 3 finish points are all within 3 milliseconds so it is faster and longer parts are overlapped(read+write+compute at the same time).
If there is a way to achieve this with only 1 wait command from host side, there must a "flush" version of it too but I couldn't find.
Is there any way to achieve below picture instead of adding flushes between each pipeline step?
queue1 write checkpoint write checkpoint write
queue2 - compute checkpoint compute checkpoint compute
queue3 - checkpoint read checkpoint read
all checkpoints have to be vertically synchronized and all these actions must not start until a signal is given. Such as:
queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue1.ndcheckpoint(...);
queue1.ndwrite(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue2.ndcheckpoint(...);
queue2.ndrangekernel(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);
queue3.ndcheckpoint(...);
queue3.ndread(...);
queue1.flush()
queue2.flush()
queue3.flush()
queue1.finish()
queue2.finish()
queue3.finish()
checkpoints are all handled in device side and only 3 finish commands are needed from host side(even better,only 1 finish for all queues?)
How I bind 3 queues to 3 events with "clWaitForEvents(3, evt_);" for now is:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[2]);
if this "enqueue barrier" can talk with other queues, how could I achieve that? Do I need to keep host-side events alive until all queues are finished or can I delete them or re-use them later? From the documentation, it seems like first barrier's event can be put to second queue and second one's barrier event can be put to third one along with first one's event so maybe it is like:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(NULL, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(evt_0, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(evt_0_and_1, &evt[2]);
in the end wait for only evt[2] maybe or using only 1 same event for all:
hCommandQueue->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[0]);
hCommandQueue2->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[1]);
hCommandQueue3->commandQueue.enqueueBarrierWithWaitList(sameEvt, &evt[2]);
where to get sameEvt object?
anyone tried this? Should I start all queues with a barrier so they dont start until I raise some event from host side or lazy-executions of "enqueue" is %100 trustable to "not to start until I flush/finish" them? How do I raise an event from host to device(sameEvt doesn't have a "raise" function, is it clCreateUserEvent?)?
All 3 queues are in-order type and are in same context. Out-of-order type is not supported by all graphics cards. C++ bindings are being used.
Also there are enqueueWaitList(is this deprecated?) and clEnqueueMarker but I don't know how to use them and documentation doesn't have any example in Khronos' website.
You asked too many questions and expressed too many variants to provide you with the only solution, so I will try to answer in general that you can figure out the most suitable solution.
If the queues are bind to the same context (possibly to different devices within the same context) than it is possible to synchronize them through the events. I.e. you can obtain an event from a command submitted to one queue and use this event to synchronize a command submitted to another queue, e.g.
queue1.enqueue(comm1, /*dependency*/ NULL, /*result event*/ &e1);
queue2.enqueue(comm2, /*dependency*/ &e1, /*result event*/ NULL);
In this example, comm2 will wait for comm1 completion.
If you need to enqueue commands first but no to allow them to be executed you can create user event (clCreateUserEvent) and signal it manually (clSetUserEventStatus). The implementation is allowed to process command as soon as they enqueued (the driver is not required to wait for the flush).
The barrier seems overkill for your purpose because it waits for all commands previously submitted to the queue. You can really use clEnqueueMarker that can be used to wait for all events and provide one event to be used for other commands.
As far as I know you can retain the event at any moment if you do not need it more. The implementation should prolong the event life-time if it is required for internal purposes.
I do not know what is enqueueWaitList.
Off-topic: if you need non-trivial dependencies between calculations you may want to consider TBB flow graph and opencl_node. The opencl_node uses events for syncronization and avoids "host-device" synchronizations if possible. However, it can be tricky to use multiple queues for the same device.
As far as I know, Intel HD Graphics 530 supports out-of-order queues (at least host-side).
You are making it much harder than it needs to be. On the write queue take an event. Use that as a condition for the compute on the compute queue, and take another event. Use that as a condition on the read on the read queue. There is no reason to force any other synchronization. Note: My interpretation of the spec is that you must clFlush on a queue that you took an event from before using that event as a condition on another queue.
I am having very strange behavior in my Go code. The overall gist is that when I have
for {
if messagesRecieved == l {
break
}
select {
case result := <-results:
newWords[result.index] = result.word
messagesRecieved += 1
default:
// fmt.Printf("messagesRecieved: %v\n", messagesRecieved)
if i != l {
request := Request{word: words[i], index: i, thesaurus_word: results}
requests <- request
i += 1
}
}
}
the program freezes and fails to advance, but when I uncomment out the fmt.Printf command, then the program works fine. You can see the entire code here. does anyone know what's causing this behavior?
Go in version 1.1.2 (the current release) has still only the original (since initial release) cooperative scheduling of goroutines. The compiler improves the behavior by inserting scheduling points. Inferred from the memory model they are next to channel operations. Additionaly also in some well known, but intentionally undocumented places, such as where I/O occurs. The last explains why uncommenting fmt.Printf changes the behavior of your program. And, BTW, the Go tip version now sports a preemptive scheduler.
Your code keeps one of your goroutines busy going through the default select case. As there are no other scheduling points w/o the print, no other goroutine has a chance to make progress (assuming default GOMAXPROCS=1).
I recommend to rewrite the logic of the program in a way which avoids spinning (busy waiting). One possible approach is to use a channel send in the default case. As a perhaps nice side effect of using a buffered channel for that, one gets a simple limiter from that for free.
Is it good idea to use MPI_Barrier() to synchronize data in-between iteration steps. Please see below pseudo code.
While(numberIterations< MaxIterations)
{
MPI_Iprobe() -- check for incoming data
while(flagprobe !=0)
{
MPI_Recv() -- receive data
MPI_Iprobe() -- loop if more data
}
updateData() -- update myData
for(i=0;i<N;i++) MPI_Bsend_init(request[i]) -- setup request
for(i=0;i<N;i++) MPI_Start(request[i]) -- send data to all other N processors
if(numberIterations = MaxIterations/2)
MPI_Barrier() -- wait for all processors -- CAN I DO THIS
numberIterations ++
}
Barriers should only be used if the correctness of the program depends on it. From your pseudocode, I can't tell if that's the case, but one barrier halfway through a loop looks very suspect.
Your code will deadlock, with or without a barrier. You receive in every rank before sending any data, so none of the ranks will ever get to a send call. Most applications will have a call such as MPI_Allreduce instead of a barrier after each iteration so all ranks can decide whether an error level is small enough, a task queue is empty, etc. and thus decide whether to terminate.
In this article http://static.msi.umn.edu/rreports/2008/87.pdf it says that you have to call MPI_Free_request() before MPI_Bsend_init().