NIFI - Process Multiple Files as a Group - apache-nifi

I'm very new to Apache NIFI so it's possible that this is already covered but most of the information I can find supports a slightly different use-case.
I've got a bunch of files that are posted to an FTP or whatever -- they're all associated with each other by filename:
ID_Part1.zip
ID_Attachments.zip
ID_Part2.zip
ID_Customizations.txt
ID.done
There are a variable number of files per logical processing group, some are mandatory, some are optional, and some may be unexpected. We know they're all associated based on their ID prefix and we'll know they're all delivered once a .done file exists.
What's an appropriate way, in NIFI parlance, to ensure that none of the files belonging to any given ID are processed until the .done file exists and that the processor that receives that group of file gets access to all of them?
Some of how the data splitting and segregating is done is still magical to me, but it'd be a catastrophic failure for my requirements if some processor happened to say see all of those files except ID_Customizations.txt and process them as a valid, but secretly incomplete, group.

What's an appropriate way, in NIFI parlance, to ensure that none of
the files belonging to any given ID are processed until the .done file
exists
In your GetFile or ListFile processor you can use the property "File Filter" to firstly retrieve your .done file.
processor that receives that group of file gets access to all of them
in you .done flowfile you can use fetchFile processor to fetch all file of a target directory.
More info about theses processors :
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.ListFile/index.html
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.FetchFile/

Related

NiFi How to get the current processor Name and Processor group name through the custom processor using (Java)

I'm Creating the NiFi Custom processor using Java,
one of the requirement is to get the previous processor name and processor group (like a breadcrumb) using java code.
The previous processor name and process group name is not immediately (nor meant to be) available to processors, can you explain more about your use case? You can perhaps use a SiteToSiteProvenanceReportingTask to send provenance information back to your own NiFi instance (an Input Port, e.g.) and find the events that correspond to FlowFiles entering your custom processor, the events should have the source (previous) processor and destination (your custom) processor.
If instead you code your custom processor using InvokeScriptedProcessor with Groovy for example, then you can "bend the rules" and get at the previous processor name and such, as Groovy allows access to private members and you can assume the implementation of the ProcessContext in onTrigger is an instance of StandardProcessContext, so you can get at its members which include upstream connections and thus the previous processor. For a particular FlowFile though, I'm not sure you can use this approach to know which upstream processor it came from.
Alternatively, you could add an UpdateAttribute after each "previous processor" to set attribute(s) with the information about that processor, but that has to be hardcoded and applied to every corresponding part of the flow.
I faced this some time back. I used InvokeHTTP processor and used nifi-api/process-groups/${process_group_id} Web Service
This is how I implemented:
Identify the process group where the error handling should be done. [Action Group]
Create a new process group [Error Handling Group] next to the Action Group and add relationship to transfer files to Error Handling Group.
Use the InvokeHTTP processor and set HTTP Method to GET
Set Remote URL to http://{nifi-instance}:{port}/nifi-api/process-groups/${action_group_process_group_id}
You will get response in JSON which you will have to customize according to your needs
Please let me know if you need the XML file that I am using. I can share that. It just works fine for me

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

what is a fastest way to remove nifi flowfile content?

I have a workflow where I am getting json files as a response of rest api. I am getting approximately 100k files in a session. total size of all the files is 15GB. I have to save each file to file system, which i am doing. at the end of the process I have to wait for all the files to be present before I send a success message.
Once I save the file in FS, I am calling notify+wait. but I dont need 15 gb data in flowfile anymore. So to release some space, I thought of using either replaceText or ModifyByte to clear content. so notify+wait runs smoothly. Total wait for this process is 3 hrs.
But process is taking too long in both (replaceText or ModifyByte) case.
Can you suggest, fastest way to clear flowfile data.I do not need any attributes too. so is thr a way I can abandon old flowfile and generate kb flowfile, midway?
what i want is something like generateflowfile, but in middle, so for each of my existing flowfile, i can drop old one, and generate blank flowfile for notify and wait.
Thanks
NiFi's Content Repository and FlowFile Repository are based on a copy-on-write mechanism, so if you don't change the contents or metadata, then you are not necessarily "keeping" the 15GB across those processors.
Having said that, if all you need is the existence of such flow files on disk (but not contents or metadata), try ExecuteScript with the following Groovy script:
def flowFiles = session.get(1000)
flowFiles.each {
session.transfer(session.create(), REL_SUCCESS)
}
session.remove(flowFiles)
This script will grab up to 1000 flow files at a time, and for each one, send an empty flow file downstream. It then removes all the original incoming flow files.
Note that this (i.e. your use case) will "break" the provenance/lineage chain, so if something goes wrong in your flow, you won't be able to tell which flow files came from which parent flow files, etc. This limitation is one reason why you don't see a full processor that performs this kind of function.
In case you need to keep the attributes, lineage and metadata you can use the following code (grabs only 1 flowfile at a time). The only thing that changes is the UUID, but otherwise everything is kept - except the content of course.
f = session.get()
session.transfer(session.create(f), REL_SUCCESS)
session.remove(f)

NiFi: Get all the processors name involved in a particular run

I have a nifi template of 30 processors. There are multiple conditional branches are there in the template. Now, I want to add something at the end of template so that I can get the list of all processors name which has executed for a particular run.
How can do this?
Thanks,
You could technically insert an UpdateAttribute processor after every "operational" processor which would add an attribute containing the most recent processor, but #Bryan is correct that the provenance feature exists to provide this information automatically. If you need to operate on it, you can use the SiteToSiteProvenanceReportingTask to send that data to a Remote Process Group (linked to an Input Port on the same instance) and then treat that data as any other in NiFi and examine/transform it.

How to solve the relationship failure?

I have a processor that appears to be creating FlowFiles correctly (modified a standard processor), but when it goes to commit() the session, an exception is raised:
2016-10-11 12:23:45,700 ERROR [Timer-Driven Process Thread-6] c.s.c.processors.files.GetFileData [GetFileData[id=8f5e644d-591c-4df1-8c79-feea118bd8c0]] Failed to retrieve files due to {}  org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord transfer relationship not specified
I'm assuming this is supposed to be indicating there's no connection available to commit the transfer; however, there is a "success" relationship registered during init() in same way as original processor did it, and the success relationship out is connected to another processor input as it should be.
Any suggestions for troubleshooting?
What changes did you make to the standard processor? If you are calling methods on the ProcessSession object, ensure that you are saving the latest "version" of the FlowFile returned from those method calls, and transfer only the latest version to "success".
FlowFile references are immutable; often in code you will see an initial reference like "flowFile" pointing at the incoming flow file (from session.get() for example), then it gets updated as the flow file is mutated, such as flowFile = session.putAttribute(flowFile, "myAttribute", "myValue").
Also ensure that you have transferred or removed the latest version of each distinct flow file (not the various references to the same flow file) to some relationship (even Relationship.SELF if need be). If your processor creates a new flow file, ensure that new flow file is transferred. If the incoming flow file is no longer needed, be sure to call session.remove() on it.
There are some common patterns and additional guidance in the NiFi Developer's Guide, including test patterns; your unit test(s) for this processor should be able to flush out this error (by asserting how many flow files should have been transferred to which relationship(s) during the test).

Resources