Apache NiFi instance hangs on the "Computing FlowFile lineage..." window - apache-nifi

My Apache NiFi instance just hangs on the "Computing FlowFile lineage..." for a specific flow. Others work, but it won't show the lineage for this specific flow for any data files. The only error message in the log is related to an error in one of the processors, but I can't see how that would affect the lineage, or stop the page from loading.

This was related to two things...
1) I was using the older (but default) provenance repository, which didn't perform well, resulting in the lag in the UI. So I needed to change it...
#nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
2) Fixing #1 exposed the second issue, which was that the EnforceOrder processor was generating hundreds of provenance events per file, because I was ordering on a timestamp, which had large gaps between the values. This is apparently not a proper use case for the EnforceOrder processor. So I'll have to remove it and find another way to do the ordering.

Related

Find Provenance Data For Flowfile Within a Processor

I am attempting to develop a NiFi processor that would extend the functionality of the built-in processor "Monitor Activity".
The problem I am attempting to solve is that in my application, I would have multiple flows entering the processor, with the processor alerting by email when no flowfiles arrive within a certain time period. However, if only one of the flows stop, no alert will be triggered.
I would like to modify the processor such that it would be able to distinguish between the different flows and alert accordingly.
In order to do this, I would need a way to deferentiate between flowfiles originating from one processor and another.
I am aware NiFi keeps detailed provenance records that can be easily accessed from within the GUI interface but I'm unable to find an easy way of accessing this information programmatically from within processor code.

Apache nifi: Difference between the flowfile State and StateManagement

From what I've read here and there, the flowfile repository serves as a Write Ahead Log for apache Nifi.
When walking the configuration files, I've seen that there is a state-management configuration section. When in a Standalone mode, a local-provider is used and writes the state (by default) to .state/local/.
It seems like both the flowfile repo and the state are used both, for example, to recover from a system failure.
Would someone please explain what's the difference between them? Do they work together ?
Also, it's a best practice to have the flowfile repo and the content repo on two separate disks. What about the local state ? Should we avoid using the "boot" disk and offload to another one ? Which one: a dedicated ? Co-locate with another one (I'm co-locating database and flowfile repos).
Thanks.
The flow file repository keeps track of all the flow files in the system, which content they point to, which attributes they have, and where they are in the flow.
State Management is an API provided to processors/services that can be used to store and retrieve key/value pairs, typically for remembering where something left off. For example, a source processor that pulls data since some timestamp would want to store the last timestamp it used so that if NiFi restarts it can retrieve this value and start from there again.

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

Best practices to handle errors in NIFI

I'm using NIFI, and i have data flows where I use the following processos :
ExecuteScript
RouteOnAttribute
FetchMapDistribuedCache
InvokeHTTPRequest
EvaluateJSONPath
and two level process group like NIFI FLOW >>> Process group 1 >>> Process group 2, my question is how to handle errors in this case, I have created output port for each processor to output errors outside the process group and in the NIFI Flow I have done a funnel for each error type and then put all those errors catched in Hbase so i can do some reporting later on, and as you can imagine this add multiples relationships and my simple dataflow start to became less visible.
My questions are, what's the best practices to handle errors in processors, and what's the best approach to do some error reporting using NIFI ( Email or PDF )
It depends on the errors you routinely encounter. Some processors may fail to perform a task (an expected but not desired outcome), and route the failed flowfile to REL_FAILURE, a specific relationship which can be connected to a processor to handle these failures, or back to the same processor to be retried. Others (or the same processors in different scenarios) may encounter exceptions, which are unexpected occurrences which cannot be resolved by the processor.
An example of this is PutKafka vs. EncryptContent. If the remote Kafka system is temporarily unavailable, the processor would fail to send the flowfile content. However, retrying after some delay period could be successful if the remote system is once again available. However, decrypting cipher text with the wrong key will always throw an exception, no matter how many times it is attempted or how long the retry delay is.
Many users route the errors to PutEmail processor and report them to a specific user/group who can evaluate the errors and monitor the data flow if necessary. You can also use "Reporting Tasks" to monitor metrics or ingest provenance data as operational data and route that to email/offline storage, etc. to run analytics on it.

What is the purpose of data provenance in Apache NiFi Processors

For every processor there is a way to configure the processor and there is a context menu to view data provenance.
Is there a good explanation of what is data provenance?
Data provenance is all about understanding the origin and attribution of data. In a typical system you get 'logs'. When you consider data flowing through a series of processes and queues you end up with a lot of lots of course. If you want to follow the path a given piece of data took, or how long it took to take that path, or what happened to an object that got split up into different objects and so on all of that is really time consuming and tough. The provenance that NiFi supports is like logging on steroids and is all about keeping and tracking these relationships between data and the events that shaped and impacted what happened to it. NiFi is keeping track of where each piece of data comes from, what it learned about the data, maintains the trail across splits, joins, transformations, where it sends it, and ultimately when it drops the data. Think of it like a chain of custody for data.
This is really valuable for a few reasons. First, understanding and debugging. Having this provenance capture means from a given event you can go forwards or backwards in the flow to see where data came from and went. Given that NiFi also has an immutable versioned content store under the covers you can also use this to click directly to the content at each stage of the flow. You can also replay the content and context of a given event against the latest flow. This in turn means much faster iteration to the configuration and results you want. This provenance model is also valuable for compliance reasons. You can prove whether you sent data to the correct systems or not. If you learn that you didn't then have data with which you can address the issue or create a powerful audit trail for follow-up.
The provenance model in Apache NiFi is really powerful and it is being extended to the Apache MiNiFi which is a subproject of Apache NiFi as well. More systems producing more provenance will mean you have a far stronger ability to track data from end-to-end. Of course this becomes even more powerful when it can be combined with other lineage systems or centralized lineage stores. Apache Atlas may be a great system to integrate with for this to bring a centralized view. NiFi is able to not only do what I described above but to also send these events to such a central store. So, exciting times ahead for this.
Hope that helps.

Resources