I'm new to NiFi, so I'm not sure if this is possible (or a correct design approach):
I'm trying to create a processing pipeline where NiFi fetches a file (potentially multiple gigabytes in size) and performs some processing -- pretty straightforward until ...
A small portion of the resultant data needs to be shown to an end-user. The user would provide an input, and then NiFi will need to perform follow-on processing of the original data set.
I was going to use a Wait processor, prompt the user (via a GUI) using a PostHTTP processor, and then use a Notify processor. The flow looks like this:
FetchFile ==> Initial Processing ==> Wait
Prompt User for input (via PostHTTP processor)
Receive User Input (via ListenHTTP processor)
Notify ==> Follow-on Processing
Unfortunately, the PostHTTP processor posts ALL of the FlowFile's content to the specified endpoint ... which, again, could be multiple gigabytes in size ... which is prohibitive.
Is there a "standalone" NiFi processor (which doesn't include the FlowFile content) for this sort of User interaction? Is there another implementation strategy for this use-case? Or is this not even a correct application of NiFi?
Thank you!
A possible approach would be to write the file out to a temporary store (local disk, HDFS, S3, etc.), write a request for user action, and essentially exit NiFi completely. The flow would then be restarted after the user has taken action. To trigger the restart, you could have the UI post to NiFi, write to a message queue, or write another file to a directory NiFi is polling.
I think this would be more flexible for managing the user interaction, since you wouldn't have to complete the whole UI response in an HTTP post cycle.
Related
I am attempting to develop a NiFi processor that would extend the functionality of the built-in processor "Monitor Activity".
The problem I am attempting to solve is that in my application, I would have multiple flows entering the processor, with the processor alerting by email when no flowfiles arrive within a certain time period. However, if only one of the flows stop, no alert will be triggered.
I would like to modify the processor such that it would be able to distinguish between the different flows and alert accordingly.
In order to do this, I would need a way to deferentiate between flowfiles originating from one processor and another.
I am aware NiFi keeps detailed provenance records that can be easily accessed from within the GUI interface but I'm unable to find an easy way of accessing this information programmatically from within processor code.
I am trying to use Nifi to act as a router for syslog based on a list of regexes matching the syslog.body (nb as this is just a proof of concept I can change any part if needed)
The thought process is that via a separate system (for now, vi and a text file 😃) an admin can define a list of criteria (regex format for each seems sensible) which, if matched, would result in syslog messages being sent to a specific separate system (for example, all critical audit data (matched by the regex list) is sent to the audit system and all other data goes to the standard log store
I know that this can be done on Route by content processors but the properties are configured before the processor starts and an admin would have to stop the processor every time they need to make an edit
I would like to load the list of regex in periodically (automatically) and have the processor properties be updated
I don’t mind if this is done all natively in Nifi (but that is preferable for elegance and to save an external app being written) or via a REST API call driven by a python script or something (or can Nifi send REST calls to itself?!)
I appreciate a processor property cannot be updated while running, so it would have to be stopped to be updated, but that’s fine as the queue will buffer for the brief period. Maybe a check to see if the file has changed could avoid outages for no reason rather than periodic update regardless, I can solve that problem later.
Thanks
Chris
I think the easiest solution would be to use ScanContent, a processor which specifies a dictionary file on disk which contains a list of search terms and monitors the file for changes, reloading in that event. The processor then applies the search terms to the content of incoming flowfiles and allows you to route them based on matches. While this processor doesn't support regular expressions as dictionary terms, you could make a slight modification to the code or use this as a baseline for a custom processor with those changes.
If that doesn't work for you, there are a number of LookupService implementations which show how CSV, XML, property files, etc. can be monitored and read by the controller framework to provide an updated mapping of key/value pairs. These can also serve as a foundation for building a more complicated scan/match flow using the loaded terms/patterns.
Finally, if you have to rely on direct processor property updating, you can script this with the NiFi API calls to stop, update, and restart the processors so it can be done in near-real-time. To determine these APIs, visit the API documentation or execute the desired tasks via the UI in your browser and use the Developer Tools to capture the HTTP requests being made.
I use rest api in my program,I made a processor group for convent a mongodb collection to json file:
I want to run the scheduling only one time,so I set the "Run schedule" to 10000 sec.Then I will stop the group when the data flow have ran one time,and I made a Notify processor and add a DistributedMapCacheService.But the DistributedMapCacheClientService of the Notify processor only comunicates with the DistributedMapCacheService in nifi itself,It never nofity my program.
I try to use my own socket server,but I only get a message "nifi" but no more message.
My question is:If I only want scheduling run once and stop it,how do I know when shall I stop it?Or is there some other way to achieve my purpose,like detect if the json file exists or use incremental data(If the scheduling run twice,the data will be repeated twice)?
As #daggett said you can do it in a synchronous way you can use HandleHttpRequest as trigger and HandleHttpResponse to manage the response.
For an asynchronous was you have several options for the notification like PutTCP, PostHTTP, GetHTTP, use FTP, file system, XMPP or whatever.
If the scheduling run twice the duplicated elements depends on the processors you use, some of them have state others no, but if you are facing problems with repeated elements you can use the DetectDuplicate processor.
If the submitted data to NiFi are not coming in a steady flow (but on bursty) how can NiFi handle them? Does it use a message broker to buffer them? I haven't seen anything like this in its documentation.
NiFi connections (the links between processors) have the capability of buffering FlowFiles (the unit of data that NiFi handles, basically content + metadata about that content), and NiFi also has the feature of backpressure, the ability of a processor to "tell" the upstream flow that it cannot handle any more data at a particular time. The relevant discussion from User Guide is here.
Basically you can set up connection(s) to be as "wide" as you expect the burst to be, or if that is not prudent, you can set it to a more appropriate value, and NiFi will do sort of a "leaky bucket with notification" approach, where it will handle what data it can, and the framework will handle scheduling of upstream processors based on whether they would be able to do their job.
If you are getting data from a source system that does not buffer data, then you can suffer data loss when backpressure is applied; however that is because the source system must push data when NiFi can not prudently accept it, and thus the alterations should be made on the source system vs the NiFi flow.
I'm writing two dataflows, one is a webservice with HandleHttpRequest/Response processors, that after receiving a notification should trigger a separate flow with GetFTP to get files from an FTP directory.
I've tried to sync both using Wait/Notify processors, but GetFTP doesn't allow incoming connections so I cannot connect a Wait proc to it.
Any idea about how I can to do this?
FetchFTP can be used in this case, as it is designed to be used in conjunction with ListFTP.
This is a common pattern in Apache NiFi -- there will be a GetX processor, and then there will be ListX and FetchX processors which are used in tandem. ListX scans the source directory/listing/etc. and generates a flowfile for each matching result, and sends them to FetchX to retrieve each item individually.
If you already know the relevant values (i.e. file names), you can provide those to the FetchFTP processor. If not, you'll be in the same position you are in now, because ListFTP is also a source processor and thus does not accept incoming connections. You could technically use the Wait/Notify processors to trigger a REST API invocation to start/stop the GetFTP processor (see Apache NiFi REST API -- PUT /processors/{id}), but this is admittedly hacky.