How can NiFi handle burst data? - apache-nifi

If the submitted data to NiFi are not coming in a steady flow (but on bursty) how can NiFi handle them? Does it use a message broker to buffer them? I haven't seen anything like this in its documentation.

NiFi connections (the links between processors) have the capability of buffering FlowFiles (the unit of data that NiFi handles, basically content + metadata about that content), and NiFi also has the feature of backpressure, the ability of a processor to "tell" the upstream flow that it cannot handle any more data at a particular time. The relevant discussion from User Guide is here.
Basically you can set up connection(s) to be as "wide" as you expect the burst to be, or if that is not prudent, you can set it to a more appropriate value, and NiFi will do sort of a "leaky bucket with notification" approach, where it will handle what data it can, and the framework will handle scheduling of upstream processors based on whether they would be able to do their job.
If you are getting data from a source system that does not buffer data, then you can suffer data loss when backpressure is applied; however that is because the source system must push data when NiFi can not prudently accept it, and thus the alterations should be made on the source system vs the NiFi flow.

Related

Find Provenance Data For Flowfile Within a Processor

I am attempting to develop a NiFi processor that would extend the functionality of the built-in processor "Monitor Activity".
The problem I am attempting to solve is that in my application, I would have multiple flows entering the processor, with the processor alerting by email when no flowfiles arrive within a certain time period. However, if only one of the flows stop, no alert will be triggered.
I would like to modify the processor such that it would be able to distinguish between the different flows and alert accordingly.
In order to do this, I would need a way to deferentiate between flowfiles originating from one processor and another.
I am aware NiFi keeps detailed provenance records that can be easily accessed from within the GUI interface but I'm unable to find an easy way of accessing this information programmatically from within processor code.

NiFi - Use Wait/Notify for triggering GetFTP

I'm writing two dataflows, one is a webservice with HandleHttpRequest/Response processors, that after receiving a notification should trigger a separate flow with GetFTP to get files from an FTP directory.
I've tried to sync both using Wait/Notify processors, but GetFTP doesn't allow incoming connections so I cannot connect a Wait proc to it.
Any idea about how I can to do this?
FetchFTP can be used in this case, as it is designed to be used in conjunction with ListFTP.
This is a common pattern in Apache NiFi -- there will be a GetX processor, and then there will be ListX and FetchX processors which are used in tandem. ListX scans the source directory/listing/etc. and generates a flowfile for each matching result, and sends them to FetchX to retrieve each item individually.
If you already know the relevant values (i.e. file names), you can provide those to the FetchFTP processor. If not, you'll be in the same position you are in now, because ListFTP is also a source processor and thus does not accept incoming connections. You could technically use the Wait/Notify processors to trigger a REST API invocation to start/stop the GetFTP processor (see Apache NiFi REST API -- PUT /processors/{id}), but this is admittedly hacky.

Method for collecting user input within NiFi topology

I'm new to NiFi, so I'm not sure if this is possible (or a correct design approach):
I'm trying to create a processing pipeline where NiFi fetches a file (potentially multiple gigabytes in size) and performs some processing -- pretty straightforward until ...
A small portion of the resultant data needs to be shown to an end-user. The user would provide an input, and then NiFi will need to perform follow-on processing of the original data set.
I was going to use a Wait processor, prompt the user (via a GUI) using a PostHTTP processor, and then use a Notify processor. The flow looks like this:
FetchFile ==> Initial Processing ==> Wait
Prompt User for input (via PostHTTP processor)
Receive User Input (via ListenHTTP processor)
Notify ==> Follow-on Processing
Unfortunately, the PostHTTP processor posts ALL of the FlowFile's content to the specified endpoint ... which, again, could be multiple gigabytes in size ... which is prohibitive.
Is there a "standalone" NiFi processor (which doesn't include the FlowFile content) for this sort of User interaction? Is there another implementation strategy for this use-case? Or is this not even a correct application of NiFi?
Thank you!
A possible approach would be to write the file out to a temporary store (local disk, HDFS, S3, etc.), write a request for user action, and essentially exit NiFi completely. The flow would then be restarted after the user has taken action. To trigger the restart, you could have the UI post to NiFi, write to a message queue, or write another file to a directory NiFi is polling.
I think this would be more flexible for managing the user interaction, since you wouldn't have to complete the whole UI response in an HTTP post cycle.

Best practices to handle errors in NIFI

I'm using NIFI, and i have data flows where I use the following processos :
ExecuteScript
RouteOnAttribute
FetchMapDistribuedCache
InvokeHTTPRequest
EvaluateJSONPath
and two level process group like NIFI FLOW >>> Process group 1 >>> Process group 2, my question is how to handle errors in this case, I have created output port for each processor to output errors outside the process group and in the NIFI Flow I have done a funnel for each error type and then put all those errors catched in Hbase so i can do some reporting later on, and as you can imagine this add multiples relationships and my simple dataflow start to became less visible.
My questions are, what's the best practices to handle errors in processors, and what's the best approach to do some error reporting using NIFI ( Email or PDF )
It depends on the errors you routinely encounter. Some processors may fail to perform a task (an expected but not desired outcome), and route the failed flowfile to REL_FAILURE, a specific relationship which can be connected to a processor to handle these failures, or back to the same processor to be retried. Others (or the same processors in different scenarios) may encounter exceptions, which are unexpected occurrences which cannot be resolved by the processor.
An example of this is PutKafka vs. EncryptContent. If the remote Kafka system is temporarily unavailable, the processor would fail to send the flowfile content. However, retrying after some delay period could be successful if the remote system is once again available. However, decrypting cipher text with the wrong key will always throw an exception, no matter how many times it is attempted or how long the retry delay is.
Many users route the errors to PutEmail processor and report them to a specific user/group who can evaluate the errors and monitor the data flow if necessary. You can also use "Reporting Tasks" to monitor metrics or ingest provenance data as operational data and route that to email/offline storage, etc. to run analytics on it.

What is the purpose of data provenance in Apache NiFi Processors

For every processor there is a way to configure the processor and there is a context menu to view data provenance.
Is there a good explanation of what is data provenance?
Data provenance is all about understanding the origin and attribution of data. In a typical system you get 'logs'. When you consider data flowing through a series of processes and queues you end up with a lot of lots of course. If you want to follow the path a given piece of data took, or how long it took to take that path, or what happened to an object that got split up into different objects and so on all of that is really time consuming and tough. The provenance that NiFi supports is like logging on steroids and is all about keeping and tracking these relationships between data and the events that shaped and impacted what happened to it. NiFi is keeping track of where each piece of data comes from, what it learned about the data, maintains the trail across splits, joins, transformations, where it sends it, and ultimately when it drops the data. Think of it like a chain of custody for data.
This is really valuable for a few reasons. First, understanding and debugging. Having this provenance capture means from a given event you can go forwards or backwards in the flow to see where data came from and went. Given that NiFi also has an immutable versioned content store under the covers you can also use this to click directly to the content at each stage of the flow. You can also replay the content and context of a given event against the latest flow. This in turn means much faster iteration to the configuration and results you want. This provenance model is also valuable for compliance reasons. You can prove whether you sent data to the correct systems or not. If you learn that you didn't then have data with which you can address the issue or create a powerful audit trail for follow-up.
The provenance model in Apache NiFi is really powerful and it is being extended to the Apache MiNiFi which is a subproject of Apache NiFi as well. More systems producing more provenance will mean you have a far stronger ability to track data from end-to-end. Of course this becomes even more powerful when it can be combined with other lineage systems or centralized lineage stores. Apache Atlas may be a great system to integrate with for this to bring a centralized view. NiFi is able to not only do what I described above but to also send these events to such a central store. So, exciting times ahead for this.
Hope that helps.

Resources