Apache NiFi - Can it scale at the processor level? - etl

Newbie Alert to Apache NiFi!
Curious to understand (and read relevant material) on the scalability aspects of Apache NiFi pipeline in a clustered set up.
Imagine there is a 2 node cluster Node 1 & Node 2.
A simple use case as an example:
Query a Database Table in batches of 100 (Lets say there are 10 batches).
For each batch, call a REST API (invoke Http).
If a pipeline is triggered on Node 1 in a cluster, Does this mean all the 10 batches are run only in Node 1?
Is there any work distribution "out-of-the-box" available in NiFi at every processor level? Along the lines of 5 batches are executed for the REST API calls per node.
Is the built-in queue of NiFi distributed in nature?
Or is the recommended way to scale at the processor level is to publish the output of the previous processors to a messaging middleware (like Kafka) and then make the subsequent NiFi processor to consume from it?
What's the recommended way to scale at every processor level in NiFi?

every queue has a load balancing strategy parameter with following options:
Do not load balance: Do not load balance FlowFiles between nodes in the cluster. This is the default.
Partition by attribute: Determines which node to send a given FlowFile to based on the value of a user-specified FlowFile Attribute.
Round robin: FlowFiles will be distributed to nodes in the cluster in a round-robin fashion.
Single node: All FlowFiles will be sent to a single node in the cluster.
Details in documentation:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Load_Balancing

Related

Critical section in NiFi. Access Token management

I am looking for a way to create a critical section in Apache NiFi. What I mean by that is to create a group of processors in which a single FlowFile would be processed exclusively - the next FlowFile would be picked up for processing from a queue when the previous FlowFile finishes with the last processor of the group. Please refer to the picture attached below.
Bottom line: At most one FlowFile should be processed within the critical section. Concurrent task parameters applies only to a single processor, not a group of processors.
I want to implement access token management to NiFi API. I would like to keep the token in a cache and also to limit the number of requests to NiFi API.
You can readily do this in NiFi by putting the processors in the "critical section" into a process group (PG) with an input and output port. The PG can then be configured with a flowfile concurrency of "Single flowfile per node" to make it process only one flowfile at a time. If it needs to be single flowfile per cluster, you can use a loadbalancing strategy of "single node" in a connection before entering the PG.
I'm assuming you are making a custom access token flow because the one already present in the NiFi API doesn't do what you want it to, but if not, do check the documents. It leverages the configured identity providers and gives you an access token that lasts for the configured duration (12 hours by default I think).

Distributing tasks with equal load in a multinode cluster

Spring scheduler is been triggered every one hour and the scheduler will be triggered from all the nodes deployed.
The scheduler reads a list of data from DB and does the process for each data in the list..
The processing of the data should not be duplicated across multiple nodes.
Each node should be able to uniquely identify the data in the list that it can process.
Is there any microservice pattern/distributed architecture pattern to achieve the same using distributed cache.
Please note:-
Would not be able to acquire lock on DB.
Each data in the list will have a unique id.

How do we know when a flow is completed in case we have multiple flowfiles running parallely?

I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )

Distributing data read from GetMongo in a nifi cluster

I have a clustered nifi setup and we are running GetMongo processor with the Primary mode on, so that duplicate data is not fetched. This seems to be working fine. However once I have this data I want the following processes in the chain to run on a cluster, as in parallel processing to be done on this data which has been fetched. Somehow this is not happening. So my question is below assuming GetMongo has fetched 30000 records and they are in the queue:
1) How do I check whether a processor is running its process on a single node or on all nodes. The config has been set to all nodes, but when the processor is running I see it displays 1 in the top right corner.
2) If one processor has been set to run only on primary node, do all other processors in the flow also run on Primary mode?
Example:
In the screenshot above, my getmongo is running in primary node, how do I make sure that the execute script processor runs in parallel on all 3 nifi nodes. As of now if I check the view status history in the executescript process I see data flowing only through the primary node.
Yes, that's correct. When you mark the source processor to run only the Primary Node, all the subsequent steps will only happen on that node alone since the data is residing only that node (primary node), even when you have the NiFi in a clustered mode. To make it work the way you want, you can follow either of the following two approaches:
Approach #1 : Comibination of RPG and Site-To-Site
Here your flow will look like this:
Create an Input Port on the Root Group (the very top level of the NiFi canvas)
Make GetMongo run only on Primary Node.
Connect the success relationship of the processor to a Remote Processor Group (RPG). This RPG can be configured with the cluster details itself and configure it to connect to the port you added in step #1.
From the input port, connect it to your processing logic.
Useful Links:
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/
This is cumbersome and would make your flow very complex but this how it has to be done, till NiFi 1.8. With NiFi 1.8, you can use the following approach.
Approach #2 : Load-Balanced Connections (Apache NiFi 1.8+)
Apache NiFi had a new release - 1.8, a week ago. With this release, there is a new feature (a long time coming and very much desired one) was introduced. It is called Load-Balanced Connections.
In this approach, you can simply ignore the RPG/Site-To-Site combination and rather do the following:
Connect the output of your source processor, in this case GetMongo with the subsequent processors.
Right click the success relationship of the source processor.
Click configure
Go to Settings tab
Set the Load Balance Strategy to the desired one, preferably Roudd robin in your case.
Useful Links:
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster
https://pierrevillard.com/2018/10/29/nifi-1-8-revolutionizing-the-list-fetch-pattern-and-more/

How to limit Nifi processor to run on a single node in cluster?

We are building a data workflow with NiFi and want the final (custom) processor (which runs the deduplication logic) to run only one one of the NiFi cluster nodes (instead of running on all of them). I see that NiFi 1.7.0 (which is not yet released) has a PrimaryNodeOnly annotation to enforce a single node execution behaviour. Is there a way or workaround to enforce such behaviour in NiFi 1.6.0?
NOTE: In addition to #PrimaryNodeOnly, it would be better if NiFi provides a way to run a processor on a single node only (i.e., some annotation like #SingleNodeOnly). This way the execution node need not necessarily be the primary node which therefore will reduce the load on primary node. This is just an ask for future and not necessary to solve the problem mentioned above.
There is no specific workaround to enforce it in previous versions, it is on the data flow designer to mark the intended processor(s) to run on the Primary Node only. You could write a script to query the NiFi API for processors of certain types or names, then check/set the strategy as Primary Node Only.
In NiFi 1.6.0 it's possible and looks like this:

Resources