How to apply machine learning for streaming data in Apache NIFI

How to apply machine learning for streaming data in Apache NIFI - apache-nifi

I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?

You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.

Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.

Related

Gethbase processor from 1 table Apache NIFI

gethbase >> execute_script
Hello, I have problem with backpressure object threshold when processing data from hbase to executing script with Jython. If just 1 processor is executed, my queue is always full, because the first processor is faster than the second. I was making concurrent tasks of second processor from 1 to 3 or 4 but it makes new error message. Here:
Image
Anyone here has a solution?

This might actually increase your work a bit but I would highly recommend writing Groovy for your custom implementation as opposed to Python/Jython/JRuby.
A couple of reasons for that!
Groovy was built "for the JVM" and leverages/integrates with Java more cleanly
Jython is an implementation of Python for the JVM. There is a lot of back and forth which happen between Python and JVM which can substantially increase the overhead.
If you still prefer to go with Jython, there are still a couple of things that you can do!
Use InvokeScriptedProcessor (ISP) instead of ExecuteScript. ISP is faster because it only loads the script once, then invokes methods on it, rather than ExecuteScript which evaluates the script each time.
Use ExecuteStreamCommand with command-line Python instead. You won't have the flexibility of accessing attributes, processor state, etc. but if you're just transforming content you should find ExecuteStreamCommand with Python faster.
No matter which language you choose, you can often improve performance if you use session.get(int) instead of session.get(). That way if there are a lot of flow files in the queue, you could call session.get(1000) or something, and process up to 1000 flow files per execution. If your script has a lot of overhead, you may find handling multiple flow files per execution can significantly improve performance.

spring state machine "dynamic fork"

I'm trying to model a business process using the spring state machine. So far I've been very sucessful with it but I'm stuck on trying to model a dynamic bit, where
the user is in state A
in that state he can create a short (predefined) task for a different user (a small state machine)
those users have to basically execute a state machine flow til the end
it should be possible to spawn many tasks concurently.
the user returns to state A once all created by him tasks have completed.
Here is a graphical representation of what I'm trying to achieve.
I think I could do this if I represent each task as a state machine and so on but I would prefer to avoid going that route as it would complicate the application. Ideally I would have just one state machine configuration.
In the spring reference I found the fork pseudo state to be maybe what I'm looking for however the offical example repo only covers a static configuration (https://github.com/spring-projects/spring-statemachine/blob/master/docs/src/reference/asciidoc/sm-examples.adoc#statemachine-examples-tasks) where each tasks are already defined (T1, T2, T3). For my application needs however I would want to be able to (at runtime) add "T4".
In essence I would like to know whether my requirements could be fullfilled with a single state machine and if I could use fork() for my needs. If its not the case I will welcome any advice that would push me in the right direction.

As I commented over the weekend, if you need a "dynamic" configuration then easiest way to do it is using "dynamic builder interfaces" which is same as in all other examples. It was basically added to be able to use SSM outside of a spring application context. Tasks recipe uses this model as it supports running a DAG of tasks using hierarchical regions and submachines.
You don't necessarily need fork as if parallel regions are entered using initial states it is equivalent. You however need join to wait parallel regions to join their execution.
While that recipe provide some background how thins can be done, we have hopefully something better in our roadmap which is supposed to add a dsl language which should make these kind of custom implementations a much easier to make.

Tasks vs. TPL Dataflow vs. Async/Await, which to use when?

I have read through quite a number technical documents either by some of the Microsoft team, or other authors detailing functionality of the new TPL Dataflow library, async/await concurrency frameworks and TPL. However, I have not really come across anything that clearly delineates which to use when. I am aware that each has its own place and applicability but specifically I wonder in regards to the following situation:
I have a data flow model that runs completely in-process. At the top sits a data generation component (A) which generates data and passes it on either via data flow block linkages or through raising events to a processing component (B). Some parts within (B) have to run synchronously while (A) massively benefits from parallelism as most of the processes are I/O or CPU bound (reading binary data from disk, then deserializing and sorting them). In the end the processing component (B) passes on transformed results to (C) for further usage.
I wonder specifically when to use tasks, async/await, and TPL data flow blocks in regards to the following:
Kicking off the data generation component (A). I clearly do not want to lock the gui/dashboard thus this process would have to somewhat run on a different thread/task.
How to call methods within (A), (B), and (C) that are not directly involved in the data generation and processing process but perform configuration work that may possibly take several hundred milliseconds/seconds to return. My hunch is that this is where async/await shines?
The most I struggle with is how to best design the message passing from one component to the next. TPL Dataflow looks very interesting but it is sometimes too slow for my purpose. (Note at the end in regards to performance issues). If not using TPL Dataflow how do I achieve responsiveness and concurrency by in-process inter-task/concurrent data passing? Example, clearly if I raise an event within a task the subscribed event handler runs in the same task instead of being passed to another task, correct? In summary, how can component (A) go about its business after passing on data to component (B) while component (B) retrieves the data and focuses on processing it? Which concurrency model is best used here?
I implemented data flow blocks here, but is that truly the best approach?
I guess above points in summary point to my struggle with how to design and implement API type components using standard practice? Should methods be designed async, data inputs as data flow blocks, and data output as either data flow block or event? What is the best approach in general? I am asking because most of the components mentioned above are supposed to work independently, so they can essentially be swapped out or independently altered internally without having to re-write accessors and output.
Note on performance: I mentioned TPL Dataflow blocks are sometimes slow. I deal with a high throughput, low latency type of application and target disk I/O limits and thus tpl dataflow blocks often performed much slower than, for example, a synchronous processing unit. Issue is that I do not know how to embed the process in its own task or concurrent model to achieve something similar than what tpl dataflow blocks already take care of, but without the overhead that comes with tpl df.

It sounds like you have a "push" system. Plain async code only handles "pull" scenarios.
Your choice is between TPL Dataflow and Rx. I think TPL Dataflow is easier to learn, but since you've already tried it and it won't work for your situation, I would try Rx.
Rx comes at the problem from a very different perspective: it is centered around "streams of events" rather than TPL Dataflow's "mesh of actors". Recent versions of Rx are very async-friendly, so you can use async delegates at several points in your Rx pipeline.
Regarding your API design, both TPL Dataflow and Rx provide interfaces you should implement: IReceivableSourceBlock/ITargetBlock for TPL Dataflow, and IObservable/IObserver for Rx. You can just wire up the implementations to the endpoints of your internal mesh (TPL Dataflow) or query (Rx). That way, your components are just a "block" or "observable/observer/subject" that can be composed in other "meshes" or "queries".
Finally, for your async construction system, you just need to use the factory pattern. Your implementation can call Task.Run to do configuration on a thread pool thread.

Just wanted to leave this here, if it helps someone to get a feeling when to use dataflow, because I was surprised at the TPL Dataflow performance. I had a the next scenario:
Iterate through all the C# code files in project (around 3500 files)
Read all the files lines (IO operation)
Iterate through all the file lines and find some strings in them
Return the files and their lines which have a the searched string in
I thought that this was a really good example for the TPL Dataflow but when I have just generated a new Task for each file which I needed to open, and done all the logic in that specific task, this code was faster.
From this my conclusion was to go with Await/Async/Task implementation by default, at least for such simple tasks and that TPL Dataflow was made for more complex situations, especially when you would need Batching and other more "pushed" based scenarios and when the synchronization is more of an issue.
Edit: Then I have done some more reasearch on this and created a demo project and the results are quite interesting. Because as when we have more operations and as they become more complex, the TPL Dataflow becomes more efficient.
Here is the link to the repo.

Analysis of messages sent (method invocations) within a ruby application?

Is there a tool that analyzes the messages that are sent to objects (i.e. method invocations) within a ruby application?
Ideally the tool would create a (GraphViz) diagram and is able filter classes in the results (f.i. monitor only classes specific to the application instead of all classes like String, Array and the lot).

Unless you have dtrace support, rubyprof is the next best thing.
As for graphing, you may have to use an auxiliary analysis package of some sort to get the kinds of results you want.

How would you implement a Workflow system?

I need to implement a Workflow system.
For example, to export some data, I need to:
Use an XSLT processor to transform an XML file
Use the resulting transformation to convert into an arbitrary data structure
Use the resulting (file or data) and generate an archive
Move the archive into a given folder.
I started to create two types of class, Workflow, which is responsible of adding new Step object and run it.
Each Steps implement a StepInterface.
My main concerns is all my steps are dependent to the previous one (except the first), and I'm wondering what would be the best way to handle such problems.
I though of looping over each steps and providing each steps the result of the previous (if any), but I'm not really happy with it.
Another idea would have been to allow a "previous" Step to be set into a Step, like :
$s = new Step();
$s->setPreviousStep(Step $step);
But I lose the utility of a Workflow class.
Any ideas, advices?
By the way, I'm also concerned about success or failure of the whole workflow, it means that if any steps fail I need to rollback or clean the previous data.

I've implemented a similar workflow engine a last year (closed source though - so no code that I can share). Here's a few ideas based on that experience:
StepInterface - can do what you're doing right now - abstract a single step.
Additionally, provide a rollback capability but I think a step should know when it fails and clean up before proceeding further. An abstract step can handle this for you (template method)
You might want to consider branching based on the StepResult - so you could do a StepMatcher that takes a stepResult object and a conditional - its sub-steps are executed only if the conditional returns true.
You could also do a StepException to handle exceptional flows if a step errors out. Ideally, this is something that you can define either at a workflow level (do this if any step fails) and/or at a step level.
I'd taken the approach that a step returns a well defined structure (StepResult) that's available to the next step. If there's bulky data (say a large file etc), then the URI/locator to the resource is passed in the StepResult.
Your workflow is going to need a context to work with - in the example you quote, this would be the name of the file, the location of the archive and so on - so think of a WorkflowContext
Additional thoughts
You might want to consider the following too - if this is something that you're planning to implement as a large scale service/server:
Steps could be in libraries that were dynamically loaded
Workflow definition in an XML/JSON file - again, dynamically reloaded when edited.
Remote invocation and call back - submit job to remote service with a callback API. when the remote service calls back, the workflow execution is picked up at the subsequent step in the flow.
Parallel execution where possible etc.
stateless design

Rolling back can be fit into this structure easily, as each Step will implement its own rollback() method, which the workflow can call (in reverse order preferably) if any of the steps fail.
As for the main question, it really depends on how sophisticated do you want to get. On a basic level, you can define a StepResult interface, which is returned by each step and passed on to the next one. The obvious problem with this approach is that each step should "know" which implementation of StepResult to expect. For small systems this may be acceptable, for larger systems you'd probably need some kind of configurable mapping framework that can be told how to convert the result of the previous step into the input of the next one. So Workflow will call Step, Step returns StepResult, Workflow then calls StepResultConverter (which is your configurable mapping thingy), StepResultConverter returns a StepInput, Workflow then calls the next Step with StepInput and so on.

I've had great success implementing workflow using a finite state machine. It can be as simple or complicated as you like, with multiple workflows linking to each other. Generally an FSM can be implemented as a simple table where the current state of a given object is tracked in a history table by keeping a journal of the transitions on the object and simply retrieving the last entry. So a transition would be of the form:
nextState = TransLookup(currState, Event, [Condition])
If you are implementing a front end you can use this transition information to construct a list of the events available to a given object in its current state.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio