How would you implement a Workflow system?

How would you implement a Workflow system? - algorithm

I need to implement a Workflow system.
For example, to export some data, I need to:
Use an XSLT processor to transform an XML file
Use the resulting transformation to convert into an arbitrary data structure
Use the resulting (file or data) and generate an archive
Move the archive into a given folder.
I started to create two types of class, Workflow, which is responsible of adding new Step object and run it.
Each Steps implement a StepInterface.
My main concerns is all my steps are dependent to the previous one (except the first), and I'm wondering what would be the best way to handle such problems.
I though of looping over each steps and providing each steps the result of the previous (if any), but I'm not really happy with it.
Another idea would have been to allow a "previous" Step to be set into a Step, like :
$s = new Step();
$s->setPreviousStep(Step $step);
But I lose the utility of a Workflow class.
Any ideas, advices?
By the way, I'm also concerned about success or failure of the whole workflow, it means that if any steps fail I need to rollback or clean the previous data.

I've implemented a similar workflow engine a last year (closed source though - so no code that I can share). Here's a few ideas based on that experience:
StepInterface - can do what you're doing right now - abstract a single step.
Additionally, provide a rollback capability but I think a step should know when it fails and clean up before proceeding further. An abstract step can handle this for you (template method)
You might want to consider branching based on the StepResult - so you could do a StepMatcher that takes a stepResult object and a conditional - its sub-steps are executed only if the conditional returns true.
You could also do a StepException to handle exceptional flows if a step errors out. Ideally, this is something that you can define either at a workflow level (do this if any step fails) and/or at a step level.
I'd taken the approach that a step returns a well defined structure (StepResult) that's available to the next step. If there's bulky data (say a large file etc), then the URI/locator to the resource is passed in the StepResult.
Your workflow is going to need a context to work with - in the example you quote, this would be the name of the file, the location of the archive and so on - so think of a WorkflowContext
Additional thoughts
You might want to consider the following too - if this is something that you're planning to implement as a large scale service/server:
Steps could be in libraries that were dynamically loaded
Workflow definition in an XML/JSON file - again, dynamically reloaded when edited.
Remote invocation and call back - submit job to remote service with a callback API. when the remote service calls back, the workflow execution is picked up at the subsequent step in the flow.
Parallel execution where possible etc.
stateless design

Rolling back can be fit into this structure easily, as each Step will implement its own rollback() method, which the workflow can call (in reverse order preferably) if any of the steps fail.
As for the main question, it really depends on how sophisticated do you want to get. On a basic level, you can define a StepResult interface, which is returned by each step and passed on to the next one. The obvious problem with this approach is that each step should "know" which implementation of StepResult to expect. For small systems this may be acceptable, for larger systems you'd probably need some kind of configurable mapping framework that can be told how to convert the result of the previous step into the input of the next one. So Workflow will call Step, Step returns StepResult, Workflow then calls StepResultConverter (which is your configurable mapping thingy), StepResultConverter returns a StepInput, Workflow then calls the next Step with StepInput and so on.

I've had great success implementing workflow using a finite state machine. It can be as simple or complicated as you like, with multiple workflows linking to each other. Generally an FSM can be implemented as a simple table where the current state of a given object is tracked in a history table by keeping a journal of the transitions on the object and simply retrieving the last entry. So a transition would be of the form:
nextState = TransLookup(currState, Event, [Condition])
If you are implementing a front end you can use this transition information to construct a list of the events available to a given object in its current state.

Related

How to apply machine learning for streaming data in Apache NIFI

I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?

You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.

Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.

SSIS Get List of all OLE DB Destinations in Data Flow

I have several SSIS Packages that we use to load in data from multiple different OLE DB data sources into our DB. Inside each package we have several Data Flow tasks that hold a large amount of OLE DB Sources and Destinations. What I'm looking to do is see if there is a way to get a text output the holds all of the Destinations flow configurations (Sources would be good to but not top of my list).
I'm trying to make sure that all my OLE DB Destination flows are pointed at the right table, as I've found a few hiccups, without having to double click on each Flow task and check that way, it just becomes tedious and still prone to missing things.
I'm viewing the packages in Visual Studio 2013. Any help is appreciated!

I am not aware of any programmatic ways to discover this data, other than building an application to read the XML within the *.dtsx package. Best advice, pack a lunch and have at it. I know for sure that there is nothing with respect to viewing and setting database tables (only server connections).
Though, a solution I may add once you have determined the list: create a variable(s) to store the unique connection strings and then set those connection strings inside the source/destination components. This will make it easier to manage going forward. In fact, you can take it one step further by setting the same values as parameter, as opposed to variables, which have the added benefit of being exposed on the server. This allows either you or the DBA to set the values as you promote through environments or change server nodes.
Also, I recommend rationalizing that solution into smaller solutions if possible. In my opinion, there is nothing worse than one giant solution that tries to do it all. I am not sure if any of this is helpful, but for what its worth, I do hope it helps.

You can use the SSIS Object Model for your needs..An example can be found here. Look in the method IterateAllDestinationComponentnsInPackage for the exact details. To start understanding the code, start in the Start method and follow the path.
Caveats: Make sure you use the appropriate Monikers and Class IDs for the Data Flow Tasks and your Destination Components. You can also use this for other Control Flow Tasks and Data Flow Components (for example, Source Components as your other need seems to be). Just keep in mind the appropriate Monikers and Class IDs.

How to avoid lambda trigger recursive call

I've written a lambda function that is triggered via an s3 bucket's putObject event. I am modifying the headers of an object post upload, downloading the object, and reuploading with appropriate headers. But because the function itself uses the putObject to reupload the object, the lambda triggers itself.

Three options:
Use a different API to upload your changes than the one that you have an event on. ie, if your lambda is triggered by PUT, then use a POST to modify the content afterwards (tough to do since POST isn't supported well by SDKs AFAIK, so this may not be an option).
Track usage and have a small guard at the beginning of your handler to short circuit if the only changes made to a file are ones you made. If you can't programmatically detect the headers you've set, you'll probably need a small dynamo table or similar for keeping track of which files you've already touched. This will let you abort immediately and only be charged the minimum 100ms fee.
Reorganize your project to have an 'ingest' bucket and an output bucket. Un-processed are put into the former, modified, and then placed into the latter. This has a number of advantages. The first is that you don't end up with the current situation, so that's a plus. The second is that you don't have whatever process consumes these modified files potentially pulling an unmodified version. The third is that you get better insight into the process - if something goes wrong, it's easy to see which batches of files have undergone which process.
Overall, I'd recommend option 3 for you, though I know that in my lazier moments I might try to opt for 1 or 2.
Either way, good luck.

Data sharing in GUI, Matlab

now I'm developing a GUI with pop-up windows, so actually it is a workpackage with multiple GUIs.
I have read thorough the examples given in help files (changme, and toolpalette), but I failed to animate the method to transfer data from the new one back to the old one.
Here is my problem.
I have two GUIs, A, the Main one and B that I use it to collect input data and I want to transfer the data back to B.
Question 1:
I want to define new subclasses of handles in A.
lets say,
handles.newclass
how can I define its properties, e.g. 'Strings'?
Question 2:
In A, a button has the callback
B('A', handles.A);
so we activate B.fig.
After finished the work in B,
it has collected the following data (string and double) in B(!)
title_1 itle_2 ... title_n
and
num_1 num_2 ... num_n
I want to pass the data back to A.
Following the instruction, I wrote the codes shown below.
mainHandles = guidata(A);
title = mainHandles.title_1;
set(title,'String',title_1);
However, when I go back to A, handles in A was not changed at all.
Please someon help me out here.
Thank you!
=============update================
The solution I found is adding extra variables (say handles.GUIdata) to handles structure of one GUI, and whenever the data are required, just read them from the corresponding GUI.
And It works well for me, since I have a main control panel and several sub-GUIs.

There is a short discussion of this issue here.

I have had similar issues where I wanted external batch scripts to actually control my GUI applications, but there is no reason two GUI's would not be able to do the same.
I created a Singleton object, and when the GUI application starts up it gets the reference to the Singleton controller and sets the appropriate gui handles into the object for later use. Once the Singleton has the handles it can use set and get functions to provide or exchange data to any gui control that it has the handle for. Any function/callback in the system can get the handle to the singleton and then invoke routines on that Singleton that will allow data to be exchanged or even control operations to be run. Your GUI A can, for instance, ask the controller for the value in GUI B's field X, or even modify that value directly if desired. Its very flexible.
In your case be sure to invalidate any handles if GUI A or B go away, and test if that gui component actually exists before getting or modifying any values. The Singleton object will even survive across multiple invocations of your app, as long as Matlab itself is left running, so be sure to clean up on exit if you don't want stale information laying around.
http://www.mathworks.com/matlabcentral/fileexchange/24911-design-pattern-singleton-creational

Regarding Question 2, it looks like you forgot to first specify that Figure A should be active when setting the title. Fix that and everything else looks good (at least, the small snippets you've posted).

Designing a complex workflow diagram

We've got a surprisingly complex workflow that needs to be monitored by a quasi-technical employees with an in-house webapp. There's about 30 steps, some of which are manual (editing), some are semi-automated stop points (like "the files have been received" or customer approval of certain templates), and some are completely automated (file conversion, search indexing, etc). The flowchart for all of these steps is large and complicated, and three people might be working on three completely different steps at any one time.
How would you present this vast amount of information as usefully as possible to your users? Just showing the whole diagram seems like the brute force solution. But it's big, and it'll likely get bigger as we do more things. Not to mention the complexity necessary to encode this entire diagram in HTML.

I assume you don't want to show these just for entertainment or mockery, but help the users along the way, automating as much as possible, document the process etc. It would probably help if you clearly define the goals or purpose of your app.
I don't see a point in showing the entire workflow, except for "debugging the business rules" or maybe the clients want to see it.
If your goal is to help users do their job, I would present the state of the "project" (or whatever term fits better) is at, and possible transitions to other states.
The State might be multiple mostly independent variables, e.g. one might describe the progress of content - e.g. "incomplete" / "complete" / "reviewed by 2nd staffer" / "signed off by 2nd staffer", others might contain a schedule that is developed in parallel, e.g. "test print date = not scheduled", "print date = not scheduled", "final delivery = tomorrow, preferredly yesterday".
A transition might be "Seint to customer for review", "mark as content-complete", "content modified", etc.
Is this what you have in mind?

I propose to divide your workflow in modules and represent the active state for each module.
A module is a subset of your main workflow. For example it could be divided by tasks, person, roles, department, etc. This will greatly simplify the representation of the workflow. Let's says someone is responsible for data entry at many critical moments. We can group all his tasks in one module (or sub-workflow) containing the same activities, inputs, outputs and conditions. Modules could be inter-dependants and related.
A state is where we are located in a module. In simple workflows there is only one active task. In real life we are multi-threaded! So maybe in one module many states could be active at the same time. The state also includes active inputs, outputs and memory bits.
An input is something required to perform an activity for evaluation a boolean condition. It could be a document, a piece of data, a signal...
An output is something resulting from a task: an information, a document, a signal...
Enough definitions?
Then simply convert your workflow into a LADDER LOGIC and you have your states!
See Ladder Logic definition on Wikipedia
You display only active states:
Active task(s) for the module
Inputs required / inputs confirmed
Output required / output realized
Conditions to continue
Seems abstract?
Here is a small example...
Janet enters data in the system. She manages the green tasks of the diagram. We focus only on her work, not other tasks. She knows how to do 16 tasks in the workflow. We are waiting the following actions from her to continue, and her Intranet dashboard says:
Priority 1: You must send a PO to order enough pencils for the next month based on the sales report.
Task: Send a purchase order
Inputs: Forecast report from the marketing department
Outputs: PO, vendor, item, quantity
Condition for completion: PO sent and order confirmation received from supplier
Priority 2: You must enter into the financial system the number of erasers rejected by production
Task: Data entry
Inputs: Reject count from production
Outputs: Number of rejects
Condition for completion: data entered and confirmed
We do a lot of troubleshooting on automated production systems having hundreds of thousands ladder steps (the workflow is too complex to be represented in a whole). When the system is blocked we look at each module and determine what inputs are missing to activation task completion.
Good luck!

This sounds like the sort of application for which BPEL is suited.
Of course you don't want to re-architect your system right now. But there are a number of BPEL implmentations out there, some of which include graphical editing tools. One of these might help you in your current situation, because they are good at handling scope and hiding detail. So I think you might derive benefit from drawing your workflow as a BPEL diagram even if you don't do anything else with the language.
The Wikipedia page lists several of the available implementations. In addition, Oracle's JDeveloper IDE includes a BPEL Diagrammer as part of its SOA suite; unfortunately it is no longer part of the standard install but it is still available. Find out more.

Try doing it in layers. You have the most detailed layer done, now add additional docs with the details hidden, grouped into higher-level business processes. Users should be able to safely ignore some of those details, but it's good for them to have visibility of how their part fits in to the whole.
You may need more than one higher-level document.

You can use Prezi to present this information to users in a lucid manner.
Split and present the work flow into phases such that the end user is easily able to identify the phase he is currently in.
Display as many number of phases as the number of inputs. The workflow starts with 6 different inputs so display the six different buttons on screen enabling the user to select the input that he wants.
On selecting the button zoom into the workflow depicting the next steps. This would also help the user to verify the actions that he has done so far to reach the current states.
This would also help the user to verify the actions that he has done so far to reach the current states. But this way of presenting could become cumbersome for the users as the number of steps that he has completed goes up. Say the user has almost reached the end of the workflow. To check for the next step he should go through all the steps which might frustrate the user.
To avoid this you can split the complete work flow chronologically into 3-5 phases. The phases should be split logically. The ultimate aim would be not to overwhelm the users with the full work flow. Personally i would try to avoid the task involving this workflow if presented the way you have shown. No offense. I bet you also feel the same.
Could give you a better picture if you could re-post the image after replacing the state names with numbers.

I'd recommend having the whole flow documented somewhere, but in terms of what is distributed to users, how about focusing on task-oriented flows? No one user will be responsible for the entire process I would imagine.
For example, let's say I have 2 roles, A and B, and 6 tasks, 1 through 6, executed in order. Each task may have multiple steps but is self-contained (e.g. download the file, review, run process, review again, upload). A does the even tasks and B does the odd tasks.
A would need to know about those detailed steps that comprise tasks 2, 4, and 6 but not about what goes on in 1, 3, and 5. So hand A a detailed set of flows for the tasks he is responsible for, along with a diagram that treats each task as a black box.
If the flow can't be made modular in this way, you may want to review the process itself to see why it's so complex.

How about showing an example of a workflow scenario, that is, showing the transitions in one possible passing through the workflow? You could cater this to a specific user profile and highlight the pertinent states, dimming the others. This allows them to get a clear idea of the transitions by seeing a real-life example.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio