Suppose you have an ExecuteScript processor in a NiFi flow.
This processor has 2 incoming queues.
Is there a way to choose from which Queue session.get() will pull the flowfile?
Thanks.
There's no direct way via the API to identify which queue a flow file is coming from. However you can try this:
Add an UpdateAttribute to each upstream flow before ExecuteScript. For each branch, add the same attribute with a different value, say "queue.name" = "A" for one and "queue.name" = "B" for the other
In ExecuteScript you can pass a FlowFileFilter to session.get(), to fetch flow file(s) whose queue.name attribute is "A" or "B". Note that you may get an empty list, and if you need at least one flow file to continue, you can just return if the list is empty.
Related
I'm Creating the NiFi Custom processor using Java,
one of the requirement is to get the previous processor name and processor group (like a breadcrumb) using java code.
The previous processor name and process group name is not immediately (nor meant to be) available to processors, can you explain more about your use case? You can perhaps use a SiteToSiteProvenanceReportingTask to send provenance information back to your own NiFi instance (an Input Port, e.g.) and find the events that correspond to FlowFiles entering your custom processor, the events should have the source (previous) processor and destination (your custom) processor.
If instead you code your custom processor using InvokeScriptedProcessor with Groovy for example, then you can "bend the rules" and get at the previous processor name and such, as Groovy allows access to private members and you can assume the implementation of the ProcessContext in onTrigger is an instance of StandardProcessContext, so you can get at its members which include upstream connections and thus the previous processor. For a particular FlowFile though, I'm not sure you can use this approach to know which upstream processor it came from.
Alternatively, you could add an UpdateAttribute after each "previous processor" to set attribute(s) with the information about that processor, but that has to be hardcoded and applied to every corresponding part of the flow.
I faced this some time back. I used InvokeHTTP processor and used nifi-api/process-groups/${process_group_id} Web Service
This is how I implemented:
Identify the process group where the error handling should be done. [Action Group]
Create a new process group [Error Handling Group] next to the Action Group and add relationship to transfer files to Error Handling Group.
Use the InvokeHTTP processor and set HTTP Method to GET
Set Remote URL to http://{nifi-instance}:{port}/nifi-api/process-groups/${action_group_process_group_id}
You will get response in JSON which you will have to customize according to your needs
Please let me know if you need the XML file that I am using. I can share that. It just works fine for me
I was wondering why some processors are cloning the flow file before routing it to next processor.
For example, QueryDatabaseTable processor always clones the flow file before passing it to the ContentAvrotoORC processor.
Why is there a need to clone the Flowfile ?
I think the clone event represents the fact that the same flow is being transferred to two different destinations (LogMessage and ConvertAvroToORC). So there is one flow file created by QueryDatabaseTable, then when the framework sees that success goes two places, it has to clone the flow file.
I have a nifi template of 30 processors. There are multiple conditional branches are there in the template. Now, I want to add something at the end of template so that I can get the list of all processors name which has executed for a particular run.
How can do this?
Thanks,
You could technically insert an UpdateAttribute processor after every "operational" processor which would add an attribute containing the most recent processor, but #Bryan is correct that the provenance feature exists to provide this information automatically. If you need to operate on it, you can use the SiteToSiteProvenanceReportingTask to send that data to a Remote Process Group (linked to an Input Port on the same instance) and then treat that data as any other in NiFi and examine/transform it.
I need to use PrioritizeAttributePrioritizer in NiFi.
i have observed that prioritizers in below reference.
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#settings
if i receive 10 flowfiles then i need to set the priority value for every flow file to be unique.
After that specify queue configuration must be PrioritizeAttributePrioritizer.
Then processing flowfiles based on priority value.
How can i set priority value for seperate flow files or which prioritizer in Nifi to be work for my case?
The PriorityAttributePrioritizer prioritizes flow files by looking for a flow file attribute named "priority" and sorting the flow files lexicographically based on the value of the priority.
You can set the priority attribute using an UpdateAttribute processor. For example, if you had three logical data feeds, and feed #1 was most important, feed #2 was second most important, and feed #3 was third, then you could use three UpdateAttribute processors to set the priority attribute to 1, 2, and 3, then use a funnel to converge them all.
You would set the PriorityAttributePrioritizer on the queue between the funnel and the next processor, and at this point any time a flow file with priority=1 hits the queue, it will always be processed before any flow files with priority=2 and priority=3.
Determining how to set the priority really depends on your data. It is usually based on something about the data, like a field from each flow file that is extracted to an attribute to tell it the priority, or just knowing that everything that comes from source #1 is higher priority than what comes from source #2. Setting randomly unique priorities doesn't really make sense because you don't even know what you are prioritizing on then.
If the files are named after the time they have been generated (e.g. file_2017-03-03T010101.csv), have you considered using UpdateAttributes to parse the filename into a date, that date into Epoch (which happens to be an increasing number) as a first level index / prioritizer?
This way you could have:
GetFile (single thread) -- Connector with FIFO --> UpdateAttribute (adding Epoch from filename date) -- Connector with PriorityAttributePrioritizer --> rest of your flow
Assuming the file name is file_2017-03-03T010101.csv, the expression language would be something like:
${filename:toDate("'file_'yyyy-MM-dd'T'HHmmss'.csv'", "UTC"):toNumber()}
The PriorityAttributePrioritizer prioritizes flow files by looking for a flow file attribute named "priority" .I had file name appended with date ,so I added execute script and called groovy script to extract date from file name .Then these dates are sorted and flowfiles are iterated ,based on date sorting priority is incremented & added as flowfile attribute 'priority'.
Example :
Fileone : priority 1
Filetwo : priority 2
Nififlow :
Get file -> execute script (groovy-sort files,add priority attr)->change queue priority to PriorityAttributePrioritizer.
Above configuration will process priority 1 file first and then further file processing will be done respectively.
I have a number of GenerateTableFetch processors that send Flowfiles to a downstream UpdateAttributes processor. From the UpdateAttributes, the Flowfile is passed to an ExecuteSQL processor:
Is there any way to add an attribute to a flow file coming off a queue with the position of that Flowfile in the queue? For example, After I reset/clear the state for a GenerateTableFetch, I would like to know if this is the first batch of Flowfiles coming from GenerateTableFetch. I can see the position of the FlowFile in the queue, but it would nice is there's a way that I could add that as an attribute that is passed downstream. Is this possible?
This is not an available feature in Apache NiFi. The position of a flowfile in a queue is dynamic, and will change as flowfiles are removed from the queue, either by downstream processing or by flowfile expiration.
If you are simply trying to determine if the queue was empty before a specific flowfile was added, your best solution at this time is probably to use an ExecuteScript processor to get the desired connection via the REST API, then use FlowFileQueue#isActiveQueueEmpty() to determine if the specified queue is currently empty, and add a boolean attribute to the flowfile indicating it is the "first of a batch" or whatever logic you want to apply.
"Batches" aren't really a NiFi concept. Is there a specific action you want to take with the "first" flowfile? Perhaps there is other logic (i.e. the ExecuteSQL processor hasn't operated on a flowfile in x seconds, etc.) that could trigger your desired behavior.