NiFi - Use Wait/Notify for triggering GetFTP - ftp

I'm writing two dataflows, one is a webservice with HandleHttpRequest/Response processors, that after receiving a notification should trigger a separate flow with GetFTP to get files from an FTP directory.
I've tried to sync both using Wait/Notify processors, but GetFTP doesn't allow incoming connections so I cannot connect a Wait proc to it.
Any idea about how I can to do this?

FetchFTP can be used in this case, as it is designed to be used in conjunction with ListFTP.
This is a common pattern in Apache NiFi -- there will be a GetX processor, and then there will be ListX and FetchX processors which are used in tandem. ListX scans the source directory/listing/etc. and generates a flowfile for each matching result, and sends them to FetchX to retrieve each item individually.
If you already know the relevant values (i.e. file names), you can provide those to the FetchFTP processor. If not, you'll be in the same position you are in now, because ListFTP is also a source processor and thus does not accept incoming connections. You could technically use the Wait/Notify processors to trigger a REST API invocation to start/stop the GetFTP processor (see Apache NiFi REST API -- PUT /processors/{id}), but this is admittedly hacky.

Related

Find Provenance Data For Flowfile Within a Processor

I am attempting to develop a NiFi processor that would extend the functionality of the built-in processor "Monitor Activity".
The problem I am attempting to solve is that in my application, I would have multiple flows entering the processor, with the processor alerting by email when no flowfiles arrive within a certain time period. However, if only one of the flows stop, no alert will be triggered.
I would like to modify the processor such that it would be able to distinguish between the different flows and alert accordingly.
In order to do this, I would need a way to deferentiate between flowfiles originating from one processor and another.
I am aware NiFi keeps detailed provenance records that can be easily accessed from within the GUI interface but I'm unable to find an easy way of accessing this information programmatically from within processor code.

NiFi How to get the current processor Name and Processor group name through the custom processor using (Java)

I'm Creating the NiFi Custom processor using Java,
one of the requirement is to get the previous processor name and processor group (like a breadcrumb) using java code.
The previous processor name and process group name is not immediately (nor meant to be) available to processors, can you explain more about your use case? You can perhaps use a SiteToSiteProvenanceReportingTask to send provenance information back to your own NiFi instance (an Input Port, e.g.) and find the events that correspond to FlowFiles entering your custom processor, the events should have the source (previous) processor and destination (your custom) processor.
If instead you code your custom processor using InvokeScriptedProcessor with Groovy for example, then you can "bend the rules" and get at the previous processor name and such, as Groovy allows access to private members and you can assume the implementation of the ProcessContext in onTrigger is an instance of StandardProcessContext, so you can get at its members which include upstream connections and thus the previous processor. For a particular FlowFile though, I'm not sure you can use this approach to know which upstream processor it came from.
Alternatively, you could add an UpdateAttribute after each "previous processor" to set attribute(s) with the information about that processor, but that has to be hardcoded and applied to every corresponding part of the flow.
I faced this some time back. I used InvokeHTTP processor and used nifi-api/process-groups/${process_group_id} Web Service
This is how I implemented:
Identify the process group where the error handling should be done. [Action Group]
Create a new process group [Error Handling Group] next to the Action Group and add relationship to transfer files to Error Handling Group.
Use the InvokeHTTP processor and set HTTP Method to GET
Set Remote URL to http://{nifi-instance}:{port}/nifi-api/process-groups/${action_group_process_group_id}
You will get response in JSON which you will have to customize according to your needs
Please let me know if you need the XML file that I am using. I can share that. It just works fine for me

Nifi processor to route flows based on changeable list of regex

I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.

How can NiFi handle burst data?

If the submitted data to NiFi are not coming in a steady flow (but on bursty) how can NiFi handle them? Does it use a message broker to buffer them? I haven't seen anything like this in its documentation.
NiFi connections (the links between processors) have the capability of buffering FlowFiles (the unit of data that NiFi handles, basically content + metadata about that content), and NiFi also has the feature of backpressure, the ability of a processor to "tell" the upstream flow that it cannot handle any more data at a particular time. The relevant discussion from User Guide is here.
Basically you can set up connection(s) to be as "wide" as you expect the burst to be, or if that is not prudent, you can set it to a more appropriate value, and NiFi will do sort of a "leaky bucket with notification" approach, where it will handle what data it can, and the framework will handle scheduling of upstream processors based on whether they would be able to do their job.
If you are getting data from a source system that does not buffer data, then you can suffer data loss when backpressure is applied; however that is because the source system must push data when NiFi can not prudently accept it, and thus the alterations should be made on the source system vs the NiFi flow.

Method for collecting user input within NiFi topology

I'm new to NiFi, so I'm not sure if this is possible (or a correct design approach):
I'm trying to create a processing pipeline where NiFi fetches a file (potentially multiple gigabytes in size) and performs some processing -- pretty straightforward until ...
A small portion of the resultant data needs to be shown to an end-user. The user would provide an input, and then NiFi will need to perform follow-on processing of the original data set.
I was going to use a Wait processor, prompt the user (via a GUI) using a PostHTTP processor, and then use a Notify processor. The flow looks like this:
FetchFile ==> Initial Processing ==> Wait
Prompt User for input (via PostHTTP processor)
Receive User Input (via ListenHTTP processor)
Notify ==> Follow-on Processing
Unfortunately, the PostHTTP processor posts ALL of the FlowFile's content to the specified endpoint ... which, again, could be multiple gigabytes in size ... which is prohibitive.
Is there a "standalone" NiFi processor (which doesn't include the FlowFile content) for this sort of User interaction? Is there another implementation strategy for this use-case? Or is this not even a correct application of NiFi?
Thank you!
A possible approach would be to write the file out to a temporary store (local disk, HDFS, S3, etc.), write a request for user action, and essentially exit NiFi completely. The flow would then be restarted after the user has taken action. To trigger the restart, you could have the UI post to NiFi, write to a message queue, or write another file to a directory NiFi is polling.
I think this would be more flexible for managing the user interaction, since you wouldn't have to complete the whole UI response in an HTTP post cycle.

Resources