data structures for scheduling workflow?

data structures for scheduling workflow? - algorithm

I'm wondering what kind(s) of data structures / algorithms might help facilitate handling the following situation; I'm not sure if I need a single FIFO, or a priority queue, or multiple FIFOs.
I have N objects that must proceed through a predefined workflow. Each object must complete step 1, then step 2, then step 3, then step 4, etc. Each step is either done quickly or involves a "wait" that depends on something external to finish (like the completion of a file operation or whatever). Each object maintains its own state. If I had to define an interface for these objects, it would be something like this (written below in pseudo-Java, but this question is language-agnostic):
public interface TaskObject
{
public enum State { READY, WAITING, DONE };
// READY = ready to execute next step
// WAITING = awaiting some external condition
// DONE = finished all steps
public int getCurrentStep();
// returns # of current step
public int getEndStep();
// returns # of step which is the DONE case.
public State getState();
// checks state and returns it.
// multiple calls will always be identical,
// except WAITING which can transition to READY or DONE.
public State executeStep();
// if READY, executes next step and returns getState().
// otherwise, returns getState().
}
I need to write a single-threaded scheduler that calls executeStep() on the "next" object. My problem is, I'm not sure exactly what technique I should use to determine what the "next" object is. I want it to be fair (first-come, first-serve for objects not in the WAITING state).
My gut call is to have 3 FIFOs, READY, WAITING and DONE. In the beginning all objects are placed in the READY queue, and the scheduler repeats a loop where it takes the first object off the READY queue, calls executeStep(), and places it onto the queue that's appropriate the the result of executeStep(). Except that items in the WAITING queue need to be put into the READY or DONE queue when their state changes.... argh!
Any advice?

If this has to be single threaded you can use a single FIFO queue for the ready and waiting objects and use your thread to process each object as it comes out. If it's state changes to WAITING then simply stick it back into the queue and it will be reprocessed.
Something like (psuedocode):
var item = queue.getNextItem();
var state = item.executeStep ();
if (state == WAITING)
queue.AddItem (item);
else if (state == DONE)
// add to collection of done objects
Depending on the time executeStep takes to run you may need to introduce a delay (Sleep not for) to prevent a tight polling loop. Ideally you would have the objects publish state change events and do-away with the polling altogether.
This is the kind of timeslicing approach that was commonplace in hardware and comms software before multithreading was widespread.

You don't have any way for the task object to notify you when it changes from WAITING to READY except polling it, so the WAITING and READY queues could really just be one. You can just loop around it calling executeStep() on each one in turn. If as a return value from executeStep() you receive DONE, then you remove it from that queue and stick it on the DONE queue and forget about it.
If you wanted to give "more priority" towards READY objects and attempt to run through all possible READY objects before wasting any resources polling WAITING you can maintain 3 queues like you said and only process the WAITING queue when you have nothing in the READY queue.
I personally would spend some effort to eliminate the polling of the state, and instead define an interface that the object could use to notify your scheduler when a state changes.

You might want to study the design of an operating system scheduler. Check out the Linux and *BSD for example.
Some pointers for the Linux scheduler: Inside the Linux scheduler and Understanding the Linux Kernel

NOTE - this does not address your question of how to schedule, but I would use a separate state class that defines the states and transitions. The objects should not know what states they should go through. They can be informed of what "Step" they are at, etc.
there are some patterns for that as well.
You should read up a little on operating systems - specifically the scheduler. Your example is a scaled down set of that problem and if you copy the relevant parts it should work great for you.
You can then add priority, etc.

The simplest technique that satisfies the requirements in your question is to repeatedly iterate over all TaskObjects calling executeStep() on each one.
This requires only one construct to hold the TaskObjects, and it can be any iterable structure, e.g. an array.
Since a TaskObject can transition from WAITING to READY asynchronously, you have to poll every TaskObject that you don't know is DONE.
The performance gained from not polling the DONE TaskObjects may be negligible. It depends on the processing load of calling executeStep() on a DONE TaskObject, which should be small.
A simple round-robin polling assures that once a READY TaskObject has executed a step, it will not execute another step until all other TaskObjects have had a chance to execute.
One obvious additional requirement is detecting when all TaskObjects are in the DONE state so you can stop processing.
To avoid polling DONE TaskObjects you will need to either maintain a flag for each one, or chain the TaskObjects in two queues: READY/WAITING and DONE.
If you store the TaskObjects in an array, make it an array of records, with members DoneFlag and TaskObject.
If for some reason you are storing the TaskObjects in a queue, with available enqueue() and dequeue() methods, then the overhead of two queues instead of one may be small.
-Al.

Take a look a this link.
Boost state machines vs uml
Boost has state machines. Why reinvent?

Related

How to terminate long running function after a timeout

So I a attempting to shut down a long running function if something takes too long, maybe is just a solution to treating the symptoms rather than cause, but in any case for my situation it didn't really worked out.
I did it like this:
func foo(abort <- chan struct{}) {
for {
select{
case <-abort:
return
default:
///long running code
}
}
}
And in separate function I have which after some time closes the passed chain, which it does, if I cut the body returns the function. However if there is some long running code, it does not affect the outcome it simply continues the work as if nothing has happened.
I am pretty new to GO, but it feels like it should work, but it does not. Is there anything I am missing. After all routers frameworks have timeout function, after which whatever is running is terminated. So maybe this is just out of curiosity, but I would really want how to od it.

your code only checks whether the channel was closed once per iteration, before executing the long running code. There's no opportunity to check the abort chan after the long running code starts, so it will run to completion.
You need to occasionally check whether to exit early in the body of the long running code, and this is more idiomatically accomplished using context.Context and WithTimeout for example: https://pkg.go.dev/context#example-WithTimeout

In your "long running code" you have to periodically check that abort channel.
The usual approach to implement that "periodically" is to split the code into chunks each of which completes in a reasonably short time frame (given that the system the process runs on is not overloaded).
After executing each such chunk you check whether the termination condition holds and then terminate execution if it is.
The idiomatic approach to perform such a check is "select with default":
select {
case <-channel:
// terminate processing
default:
}
Here, the default no-op branch is immediately taken if channel is not ready to be received from (or closed).
Some alogrithms make such chunking easier because they employ a loop where each iteration takes roughly the same time to execute.
If your algorithm is not like this, you'd have to chunk it manually; in this case, it's best to create a separate function (or a method) for each chunk.
Further points.
Consider using contexts: they provide a useful framework to solve the style of problems like the one you're solving.
What's better, the fact they can "inherit" one another allow one to easily implement two neat things:
You can combine various ways to cancel contexts: say, it's possible to create a context which is cancelled either when some timeout passes or explicitly by some other code.
They make it possible to create "cancellation trees" — when cancelling the root context propagates this signal to all the inheriting contexts — making them cancel what other goroutines are doing.
Sometimes, when people say "long-running code" they do not mean code actually crunching numbers on a CPU all that time, but rather the code which performs requests to slow entities — such as databases, HTTP servers etc, — in which case the code is not actually running but sleeping on the I/O to deliver some data to be processed.
If this is your case, note that all well-written Go packages (of course, this includes all the packages of the Go standard library which deal with networked services) accept contexts in those functions of their APIs which actually make calls to such slow entities, and this means that if you make your function to accept a context, you can (actually should) pass this context down the stack of calls where applicable — so that all the code you call can be cancelled in the same way as yours.
Further reading:
https://go.dev/blog/pipelines
https://blog.golang.org/advanced-go-concurrency-patterns

Long running async method vs firing an event upon completion

I have to create a library that communicates with a device via a COM port.
In the one of the functions, I need to issue a command, then wait for several seconds as it performs a test (it varies from 10 to 1000 seconds) and return the result of the test:
One approach is to use async-await pattern:
public async Task<decimal> TaskMeasurementAsync(CancellationToken ctx = default)
{
PerformTheTest();
// Wait till the test is finished
await Task.Delay(_duration, ctx);
return ReadTheResult();
}
The other that comes to mind is to just fire an event upon completion.
The device performs a test and the duration is specified prior to performing it. So in either case I would either have to use Task.Delay() or Thread.Sleep() in order to wait for the completion of the task on the device.
I lean towards async-await as it easy to build in the cancellation and for the lack of a better term, it is self contained, i.e. I don't have to declare an event, create a EventArgs class etc.
Would appreciate any feedback on which approach is better if someone has come across a similar dilemma.
Thank you.

There are several tools available for how to structure your code.
Events are a push model (so is System.Reactive, a.k.a. "LINQ over events"). The idea is that you subscribe to the event, and then your handler is invoked zero or more times.
Tasks are a pull model. The idea is that you start some operation, and the Task will let you know when it completes. One drawback to tasks is that they only represent a single result.
The coming-soon async streams are also a pull model - one that works for multiple results.
In your case, you are starting an operation (the test), waiting for it to complete, and then reading the result. This sounds very much like a pull model would be appropriate here, so I recommend Task<T> over events/Rx.

What is the Proper Non-blocking Algorithm for Modbus Master on Microcontroller

Modbus is a a request and response type serial communication. Basically the master send out a request and one of the slave response.
I am modifing the code on a microcontroller which is a master unit on a modbus network. This unit also has a small dot-matrix LCD and some buttons for user interface. The microcontroller is running at 16MHz.
The problem is after the master unit send out a request, it does not know when the slave response, so it may need to wait for a relatively long time. However as this unit has buttons and LCD, it can not wait at a point for too long because the user will feel lag when he pressed a button. The original code is using a RTOS. It seperate the user interface task and the serial communication tasks so it has no problem. Now I need to change it to non-RTOS code. I have implemented a system tick timer which will interrupt at each 1ms. What is the proper (or common) way to do that?

It is possible to do quite a lot with just a single task, especially if you have interrupts. The intermediate position between a single very simple task and an RTOS is a cyclic executive. See http://www3.nd.edu/~cpoellab/teaching/cse40463/slides10.pdf for a brief overview of the spectrum of functionality from a cyclic executive up to a fully preemptive multitasking operating system. You will find much more if you search on this phrase and related phrases, including very sophisticated schemes for making sure that the system never misses its deadlines. If you are an aircraft flight control system, forgetting to check the aircraft pitch angle every X ms can cause problems elsewhere :-)
One way to rewrite code which is naturally multi-threaded is to maintain a model of the state of the system, such as a collection of objects each representing a modbus connection, indexed by a connection id. Then write a routine for every sort of event that can happen, including the arrival of a clock interrupt. When that event happens these routines typically work out which connection is involved, retrieve it from the main collection (or create it from scratch and enter it there if necessary) do the work associated with that particular sort of event, and then return.
It is often convenient to keep a queue of future events, indexed by time, and to have a routine that creates an object representing something to be done at some future time (such as calling a method to check for the expiration of a timeout) and puts this object on the queue.
You need to worry about interrupt processing getting called halfway through an event service routine. One way to deal with this is to lock out interrupts when that could cause a problem. Another way is to have the interrupt routine do nothing more than put an object on a queue that something else will check for later, or just set a flag. Then you need only lock out interrupts when you are checking for items on the queue and removing them.
A number of communications protocols are implemented in this way. Even in a true multitasking operating system you very often don't want to have to create a new thread every time you need to create a new connection. The two main problems with this is that the code is less clear than code which has a thread per object, because stuff that naturally goes together is chopped up into loads of event service events, and if any of the event service methods burn significant amounts of cpu, the system will stall because nothing else will happen when this is going on.

How to create a finite state machine that can process simultaneous events

Suppose you have an object 'A' that can potentially receive the following events from external objects:
Event 1
Event 2
...
Event n
Now suppose that the framework that hosts 'A' is such that all relevant events will be delivered to 'A' (one at a time), and then A::doEval() will be called.
It's important to note that 'A' could receive any combination of events in any order. 'A' might only get one event before doEval() is called, or it might get 5 events before doEval() is called. There's no way to know ahead of time.
It's also important to note that these events, because they are all delivered to 'A' before A::doEval() is called, should be considered simultaneous events. A regular state machine would react to each event as it was handed to 'A'. This would be incorrect in my usage case... I need 'A' to sit back and collect all events, and only in doEval() should 'A' perform any actions.
Now here's the trick bit: The doEval() logic needs to realize that only a subset of events occurred, but that it might need factor them all in. For example, the code (this is ugly and what I'm trying to avoid) might look like this:
doEval()
if(Event 1 occurred && Event 2 occurred) then <do something>
It's that 'if' statement... I only want to perform the action if both events occurred, but I don't want to have that 'if' statement. This is what FSMs are supposed to get rid of right? Do I need to have a hierarchy of state machines?
Any ideas on the "proper" way to address this? Any links or papers to read would be great, code is even better.
Thanks!

Make a queue to gather all events directed to "A" and pop them from queue and then process events.

I found what I was looking for in the form of Harel State Machines:
http://www.mathworks.com/videos/understanding-state-machines-harel-state-machines-4-of-4-90491.html
TL/DR: Take what we all know as state machines and add the ability to have hierarchical substates, parallel state machines, and communication between independent state machines (which he calls broadcasting).

The whole job of a state machine is to capture the relevant history of events in a succinct way. A state machine exactly allows you to avoid the checks of the kind:
doEval()
if(Event 1 occurred && Event 2 occurred) then <do something>
The Mathworks video is a good starting point, but I would also recommend the following sources:
UML state machine (Wikipedia)
A crash course in UML state machines (Embedded.com article)
Back to Basics (CUJ article)

Multi-Thread Processing in .NET

I already have a few ideas, but I'd like to hear some differing opinions and alternatives from everyone if possible.
I have a Windows console app that uses Exchange web services to connect to Exchange and download e-mail messages. The goal is to take each individual message object, extract metadata, parse attachments, etc. The app is checking the inbox every 60 seconds. I have no problems connecting to the inbox and getting the message objects. This is all good.
Here's where I am accepting input from you: When I get a message object, I immediately want to process the message and do all of the busy work explained above. I was considering a few different approaches to this:
Queuing the e-mail objects up in a table and processing them one-by-one.
Passing the e-mail object off to a local Windows service to do the busy work.
I don't think db queuing would be a good approach because, at times, multiple e-mail objects need to be processed. It's not fair if a low-priority e-mail with 30 attachments is processed before a high-priority e-mail with 5 attachments is processed. In other words, e-mails lower in the stack shouldn't need to wait in line to be processed. It's like waiting in line at the store with a single register for the bonehead in front of you to scan 100 items. It's just not fair. Same concept for my e-mail objects.
I'm somewhat unsure about the Windows service approach. However, I'm pretty confident that I could have an installed service listening, waiting on demand for an instruction to process a new e-mail. If I have 5 separate e-mail objects, can I make 5 separate calls to the Windows service and process without collisions?
I'm open to suggestions or alternative approaches. However, the solution must be presented using .NET technology stack.

One option is to do the processing in the console application. What you have looks like a standard producer-consumer problem with one producer (the thread that gets the emails) and multiple consumers. This is easily handled with BlockingCollection.
I'll assume that your message type (what you get from the mail server) is called MailMessage.
So you create a BlockingCollection<MailMessage> at class scope. I'll also assume that you have a timer that ticks every 60 seconds to gather messages and enqueue them:
private BlockingCollection<MailMessage> MailMessageQueue =
new BlockingCollection<MailMessage>();
// Timer is created as a one-shot and re-initialized at each tick.
// This prevents the timer proc from being re-entered if it takes
// longer than 60 seconds to run.
System.Threading.Timer ProducerTimer = new System.Threading.Timer(
TimerProc, null, TimeSpan.FromSeconds(60), TimeSpan.FromMilliseconds(-1));
void TimerProc(object state)
{
var newMessages = GetMessagesFromServer();
foreach (var msg in newMessages)
{
MailMessageQueue.Add(msg);
}
ProducerTimer.Change(TimeSpan.FromSeconds(60), TimeSpan.FromMilliseconds(-1));
}
Your consumer threads just read the queue:
void MessageProcessor()
{
foreach (var msg in MailMessageQueue.GetConsumingEnumerable())
{
ProcessMessage();
}
}
The timer will cause the producer to run once per minute. To start the consumers (say you want two of them):
var t1 = Task.Factory.StartNew(MessageProcessor, TaskCreationOptions.LongRunning);
var t2 = Task.Factory.StartNew(MessageProcessor, TaskCreationOptions.LongRunning);
So you'll have two threads processing messages.
It makes no sense to have more processing threads than you have available CPU cores. The producer thread presumably won't require a lot of CPU resources, so you don't have to dedicate a thread to it. It'll just slow down message processing briefly whenever it's doing its thing.
I've skipped over some detail in the description above, particularly cancellation of the threads. When you want to stop the program, but let the consumers finish processing messages, just kill the producer timer and set the queue as complete for adding:
MailMessageQueue.CompleteAdding();
The consumers will empty the queue and exit. You'll of course want to wait for the tasks to complete (see Task.Wait).
If you want the ability to kill the consumers without emptying the queue, you'll need to look into Cancellation.
The default backing store for BlockingCollection is a ConcurrentQueue, which is a strict FIFO. If you want to prioritize things, you'll need to come up with a concurrent priority queue that implements the IProducerConsumerCollection interface. .NET doesn't have such a thing (or even a priority queue class), but a simple binary heap that uses locks to prevent concurrent access would suffice in your situation; you're not talking about hitting this thing very hard.
Of course you'd need some way to prioritize the messages. Probably sort by number of attachments so that messages with no attachments are processed quicker. Another option would be to have two separate queues: one for messages with 0 or 1 attachments, and a separate queue for those with lots of attachments. You could have one of your consumers dedicated to the 0 or 1 queue so that easy messages always have a good chance of being processed first, and the other consumers take from the 0 or 1 queue unless it's empty, and then take from the other queue. It would make your consumers a little more complicated, but not hugely so.
If you choose to move the message processing to a separate program, you'll need some way to persist the data from the producer to the consumer. There are many possible ways to do that, but I just don't see the advantage of it.

I'm somewhat a novice here, but it seems like an initial approach could be to have a separate high-priority queue. Every time a worker is available to obtain a new message, it could do something like:
If DateTime.Now - lowPriorityQueue.Peek.AddedTime < maxWaitTime Then
ProcessMessage(lowPriorityQueue.Dequeue())
Else If highPriorityQueue.Count > 0 Then
ProcessMessage(highPriorityQueue.Dequeue())
Else
ProcessMessage(lowPriorityQueue.Dequeue())
End If
In a single thread, while you can still have one message blocking the others, higher priority messages could be processed sooner.
Depending on how fast most messages get processed, the application could create a new worker on a new thread if the queues are getting too big or too old.
Please tell me if I'm completely off-base here though.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio