Multiple flows with nifi - apache-nifi

We have multiple (50+) nifi flows that all do basically the same thing: pull some data out of a db, append some columns conver to parquet and upload to hdfs. They differ only in details such as the sql query to run or the location in hdfs that they land.
The question is how to factor these common nifi flows out such that any change made to the common flow automatically applies to all all derived flows. E.g if i want to add an extra step to also publish the data to Kafka I want to make this once and have it automatically apply to all 50 flows.
We’ve tried to get this working with nifi registry, however it seems like an imperfect fit. Essentially the issue is that nifi registry seems to work well for updating a flow in one environment (say wat) and then autmatically updating it in another environment (say prod). It seems less suited for updating multiple flows in the same environment with one specific example bing that it will reset the name of each flow to be the template name every time we redeploy meaning that al flows end up with the same name!
Does anyone know how one is supposed to manage a situation like ours asi guess it must be pretty common.

Apache NiFi has ProcessorGroups. As the name itself suggests, the processor groups are there to group together a set of processors' and their pipeline that does similar task.
So for your case what you can do is, you can refactor the flow by moving the common flow which can be reused with different pipelines to a separate processor group with an input port. Connect the outside flow that depends on this reusable flow by connecting to the input port of the reusable processor group. Depending on your requirement you can create an output port as well in this processor group and connect it with the outside flow.
Attaching a sample:
For the sake of explaining, I have made a mock flow so ignore the Processor types that are used, but rather see the name I had given to those processors.
The following screenshots show that I read from two different sources and individually connect them to two different processors that does the source specific changes to those processors
Then I connect these two flows to the input port of a processor group that has the reusable flow inside. So ultimately the two different flows shown in the above screenshot gets to work with a common reusable flow.
Showing what's inside the reusable flow:
Finally the output port output to outside connects the reusable flow to the outside component Write to somewehere
I hope this helps you with refactoring your complex flows. Feel free to get back, if you have any queries.

Related

With CQRS Pattern how are you still limiting to one service per database

According to my understanding
We should only have one service connecting to a database
With CQRS you will be keeping two databases in sync, hypothetically using some “service” glueing them together
Doesn’t that now mean there’s a service which only purpose is to keep the two in sync, and another service to access the data.
Questions
Doesn’t that go against rule number above? Or does this pattern only apply when native replication is being used?
Also, other than being able to independently scale the replicated database for more frequent reads, does the process of keeping both in sync kind of take away from that? Either way we’re writing the same data to both in the end.
Ty!
We should only have one service connecting to a database
I would rephrase this to: each service should be accessible via that service's api. And all internals, like database, should be completely hidden. Hence, there should be no (logical) database sharing between services.
With CQRS you will be keeping two databases in sync, hypothetically using some “service” glueing them together
CQRS is a pattern for splitting how a service talks to a data layer. Typical example would be something like separating reads and writes; as those are fundamentally different. E.g. you do rights as commands via a queue and reads as exports via some stream.
CQRS is just an access pattern, using it (or not using it) does nothing for synchronization. If you do need a service to keep two other ones in sync, then you still should use services' api's instead of going into the data layer directly. And CQRS could be under those api's to optimize data processing.
The text from above might address your first question. As for the second one: keeping database incapsulated to a service does allow that database (and service) to be scaled as needed. So if you are using replication for reads, that would be a reasonable solutions (assuming you address async vs sync replication).
As for "writing data on both ends", I am actually confused what does that mean...

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

Transform an IoT application from monolothic to microservices

I have a system composed of 3 sensors (Temperature, humidity, camera) attached to Arduino, 1 cloud, and 1 Mobile phone. I developed a monolithic IoT application that has different tasks needed to be executed in these three different locations (Arduino, cloud Mobile). all these sensors have common tasks which are: data detection, data transferring (executed on Arduino), data saving, data analysis and data notification (on the cloud), data visualization (on Mobile).
The problem here I know that a microservice is independent and it has its database. How to transform this application that I have to a one using microservice architecture? the first idea is representing each task as a microservice.
At the first, I considered each task as a component and I thought to represent each one as a microservice but they are linked. I mean that the output of the previous task is the input of the present one, So I can't make it like this because they aren't independent. Another thing for data collection microservice it should be placed on Arduino and the data should be sent to the cloud to be stored there in the database, so here we have a distant DB. For the data collection, I have the same idea as you since there are different things (sensors) so there will be diff microservices like (temperature data collection, camera data collection...).
First let me clear a confusion : when we say microservices are independent then how can we design a microservice in which output of the previous task is input for the next one.
First when we say microservice it means it is indepently deployable and manageable but as in any system there are dependencies microservices also depends upon each other. You can read about reactice microservice.
So you can have microservices which depend on one another, but we want these dependencies to be minimum.
Now lets understand benefits we want to adopt while doing microservice (this will help to answer your question):
Indepently deployable components (which scale up the deployment speed)- As in any big application there are components which are relatively independent of each other then if I want to change something in one component I should be confident another will not be impacted. In monolithic as all are inone binary impact would be high.
Independently Scalable - as diff. components require diff. scale we can have diff. types of databases and machine requirement.
and there are various and also some overhead which a microservice architecture bring (cant go in detail here , read on these things online)
NOW WE WILL DISCUSS the approach
As data collection is independent on how and what kind of analysis happen on that. I would have a DataCollectionService on cloud (collects data from all sensors, we create diff. for diff. sensors if those data are completely independent).
DataAnalysis as separate service (dosent need to know a thing about how data is collected like is it using mqtt , webscoket , periodic or in batches or whatever). This service needs data and will act upon it.
Notification Service
DataSendClient on Arduino : some client which sends data to data collection service.

Using NiFi for scheduling Hadoop batch processes

According to NiFi's homepage, it "supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic".
I've been playing with NiFi in the last couple of months and can't help wondering why not using it also for scheduling batch processes.
Let's say I have a use case in which data flows into Hadoop, processed by a series of Hive \ MapReduce jobs, then exported to some external NoSql database to be used by some system.
Using NiFi in order to ingest and flow data into Hadoop is a use case that NiFi was built for.
However, using Nifi in order to schedule the jobs on Hadoop ("Oozie-like") is a use case I haven't encountered others implementing, and since it seems completely possible to implement, I'm trying to understand if there are reasons not to do so.
The gains of doing it all on NiFi is that one will get a visual representation of the entire course of data, from source to destination in one place. In case of complicated flows it is very important for maintenance.
In other words - my question is: Are there reasons not to use NiFi as a scheduler\coordinator for batch processes? If so - what problems may arise in such a use case?
PS - I've read this: "Is Nifi having batch processing?" - but my question aims to a different sense of "batch processing in NiFi" than the one raised in the attached question
You are correct that you would have a concise visual representation of all steps in the process were the schedule triggers present on the flow canvas, but NiFi was not designed as a scheduler/coordinator. Here is a comparison of some scheduler options.
Using NiFi to control scheduling feels like a "hammer" solution in search of a problem. It will reduce the ease of defining those schedules programmatically or interacting with them from external tools. Theoretically, you could define the schedule format, read them into NiFi from a file, datasource, endpoint, etc., and use ExecuteStreamCommand, ExecuteScript, or InvokeHTTP processors to kick off the batch process. This feels like introducing an unnecessary intermediate step however. If consolidation & visualization is what you are going for, you could have a monitoring flow segment ingest those schedule definitions from their native format (Oozie, XML, etc.) and display them in NiFi without making NiFi responsible for defining and executing the scheduling.

ETL , Esper or Drools?

The question environment relates to JavaEE, Spring
I am developing a system which can start and stop arbitrary TCP (or other) listeners for incoming messages. There could be a need to authenticate these messages. These messages need to be parsed and stored in some other entities. These entities model which fields they store.
So for example if I have property1 that can have two text fields FillLevel1 and FillLevel2, I could receive messages on TCP which have both fill levels specified in text as F1=100;F2=90
Later I could add another filed say FillLevel3 when I start receiving messages F1=xx;F2=xx;F3=xx. But this is a conscious decision on the part of system modeler.
My question is what do you think is better to use for parsing and storing the message. ETL (using Pantaho, which is used in other system) where you store the raw message and use task executor to consume them one by one and store the transformed messages as per your rules.
One could use Espr or Drools to do the same thing , storing rules and executing them with timer, but I am not sure how dynamic you could get with making rules (they have to be made by end user in a running system and preferably in most user friendly way, ie no scripts or code, only GUI)
The end user should be capable of changing the parse rules. It is also possible that end user might want to change the archived data as well (for example in the above example if a new value of FillLevel is added, one would like to put a FillLevel=-99 in the previous values to make the data consistent).
Please ask for explanations, I have the feeling that I need to revise this question a bit.
Thanks
Well Esper is a great CEP engine, but drools has it's own implementation Drools Fusion which integrates really well with jBpm. That would be a good choice.

Resources