I am looking for a way to create a critical section in Apache NiFi. What I mean by that is to create a group of processors in which a single FlowFile would be processed exclusively - the next FlowFile would be picked up for processing from a queue when the previous FlowFile finishes with the last processor of the group. Please refer to the picture attached below.
Bottom line: At most one FlowFile should be processed within the critical section. Concurrent task parameters applies only to a single processor, not a group of processors.
I want to implement access token management to NiFi API. I would like to keep the token in a cache and also to limit the number of requests to NiFi API.
You can readily do this in NiFi by putting the processors in the "critical section" into a process group (PG) with an input and output port. The PG can then be configured with a flowfile concurrency of "Single flowfile per node" to make it process only one flowfile at a time. If it needs to be single flowfile per cluster, you can use a loadbalancing strategy of "single node" in a connection before entering the PG.
I'm assuming you are making a custom access token flow because the one already present in the NiFi API doesn't do what you want it to, but if not, do check the documents. It leverages the configured identity providers and gives you an access token that lasts for the configured duration (12 hours by default I think).
Related
Newbie Alert to Apache NiFi!
Curious to understand (and read relevant material) on the scalability aspects of Apache NiFi pipeline in a clustered set up.
Imagine there is a 2 node cluster Node 1 & Node 2.
A simple use case as an example:
Query a Database Table in batches of 100 (Lets say there are 10 batches).
For each batch, call a REST API (invoke Http).
If a pipeline is triggered on Node 1 in a cluster, Does this mean all the 10 batches are run only in Node 1?
Is there any work distribution "out-of-the-box" available in NiFi at every processor level? Along the lines of 5 batches are executed for the REST API calls per node.
Is the built-in queue of NiFi distributed in nature?
Or is the recommended way to scale at the processor level is to publish the output of the previous processors to a messaging middleware (like Kafka) and then make the subsequent NiFi processor to consume from it?
What's the recommended way to scale at every processor level in NiFi?
every queue has a load balancing strategy parameter with following options:
Do not load balance: Do not load balance FlowFiles between nodes in the cluster. This is the default.
Partition by attribute: Determines which node to send a given FlowFile to based on the value of a user-specified FlowFile Attribute.
Round robin: FlowFiles will be distributed to nodes in the cluster in a round-robin fashion.
Single node: All FlowFiles will be sent to a single node in the cluster.
Details in documentation:
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Load_Balancing
Hi Iam new to nifi and I have followed the tutorial here to understand the provenance repository content and moving it out for auditing. But I have a couple of questions here.
The main use of provenance data is to make understand what exactly happened to a piece of data. But here the data is in flow file. How are we supposed to understand what happened to a particular data using flow file?
Is the best practice is to always send data provenance data from one nifi to another? Why not use the SiteToSiteProvenanceReportingTask to send to a port in the same nifi instance and extract it out of there?
What could be the best tools that can be used for sending these data for auditing?
Hopefully this answers your questions:
You can export the provenance data many ways, to extract the content of the flowfile from the provenance event, I believe you have to get at the "content claims" for the flowfile, not sure how that works. Because the content claims are reclaimed when no flowfile in the current system is using it, I don't think you can query on provenance events' content when the content no longer exists in the content repository. Some components will add an attribute for any errors/status they encounter.
You can certainly use a SiteToSiteProvenanceReportingTask to send provenance data from a cluster back to itself, you probably just want to filter out the Input Port and Process Group that handle the processing of provenance data.
Data provenance is sometimes a graph problem but the events are often useful on their own (without needing to know the flow, e.g.) so analysis can be done on the events themselves. I've sent the events to a Hive table and then was able to do some things with HiveQL like calculating predicted backpressure on connections (before we added it to NiFi proper)
I have a requirement where we have a template which uses SQL as source and SQL as destination and data would be more than 100GB for each table so here template will be instantiated multiple times based on tables to be migrated and also each table is partitioned into multiple flowfiles. How do we know when the process is completed? As here there will be multiple flowfiles we are unable to conclude as it hits end processor.
I have tried using SitetoSiteStatusReportingTask to check queue count, but it provides count based on connection and its difficult to fetch connectionid for each connection then concatenate as we have large number of templates. Here we have another problem in reporting task as it provides data on all process groups which are available on NIFI canvas which will be huge data if all templates are running and may impact in performance even though I used avro schema to fetch only queue count and connection id.
Can you please suggest some ideas and help me to achieve this?
you have multiple solution :
1 - you can use the wait/notify duo processor.
if you dont want multiple flowfile running parallely :
2 - set backpressure on Queue
3 - specify group level flow file concurrency (recommended but Nifi 1.12 only )
I have a clustered nifi setup and we are running GetMongo processor with the Primary mode on, so that duplicate data is not fetched. This seems to be working fine. However once I have this data I want the following processes in the chain to run on a cluster, as in parallel processing to be done on this data which has been fetched. Somehow this is not happening. So my question is below assuming GetMongo has fetched 30000 records and they are in the queue:
1) How do I check whether a processor is running its process on a single node or on all nodes. The config has been set to all nodes, but when the processor is running I see it displays 1 in the top right corner.
2) If one processor has been set to run only on primary node, do all other processors in the flow also run on Primary mode?
Example:
In the screenshot above, my getmongo is running in primary node, how do I make sure that the execute script processor runs in parallel on all 3 nifi nodes. As of now if I check the view status history in the executescript process I see data flowing only through the primary node.
Yes, that's correct. When you mark the source processor to run only the Primary Node, all the subsequent steps will only happen on that node alone since the data is residing only that node (primary node), even when you have the NiFi in a clustered mode. To make it work the way you want, you can follow either of the following two approaches:
Approach #1 : Comibination of RPG and Site-To-Site
Here your flow will look like this:
Create an Input Port on the Root Group (the very top level of the NiFi canvas)
Make GetMongo run only on Primary Node.
Connect the success relationship of the processor to a Remote Processor Group (RPG). This RPG can be configured with the cluster details itself and configure it to connect to the port you added in step #1.
From the input port, connect it to your processing logic.
Useful Links:
https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/
This is cumbersome and would make your flow very complex but this how it has to be done, till NiFi 1.8. With NiFi 1.8, you can use the following approach.
Approach #2 : Load-Balanced Connections (Apache NiFi 1.8+)
Apache NiFi had a new release - 1.8, a week ago. With this release, there is a new feature (a long time coming and very much desired one) was introduced. It is called Load-Balanced Connections.
In this approach, you can simply ignore the RPG/Site-To-Site combination and rather do the following:
Connect the output of your source processor, in this case GetMongo with the subsequent processors.
Right click the success relationship of the source processor.
Click configure
Go to Settings tab
Set the Load Balance Strategy to the desired one, preferably Roudd robin in your case.
Useful Links:
https://blogs.apache.org/nifi/entry/load-balancing-across-the-cluster
https://pierrevillard.com/2018/10/29/nifi-1-8-revolutionizing-the-list-fetch-pattern-and-more/
I have a Kakfa topic which includes different types of messages sent from different sources.
I would like to use the ExtractGrok processor to extract the message based on the regular expression/grok pattern.
How do I configure or run the processor with multiple regular expression?
For example, the Kafka topic contains INFO, WARNING and ERROR log entries from different applications.
I would like to separate the different log levels messages and place then into HDFS.
Instead of Using ExtractGrok processor, use Partition Record processor in NiFi to partition as this processor
Evaluates one or more RecordPaths against the each record in the
incoming FlowFile.
Each record is then grouped with other "like records".
Configure/enable controller services
RecordReader as GrokReader
Record writer as your desired format
Then use PutHDFS processor to store the flowfile based on the loglevel attribute.
Flow:
1.ConsumeKafka processor
2.Partition Record
3.PutHDFS processor
Refer to this link describes all the steps how to configure PartitionRecord processor.
Refer to this link describes how to store partitions dynamically in HDFS directories using PutHDFS processor.