How to specify priority attributes for individual flowfiles? - apache-nifi

I need to use PrioritizeAttributePrioritizer in NiFi.
i have observed that prioritizers in below reference.
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#settings
if i receive 10 flowfiles then i need to set the priority value for every flow file to be unique.
After that specify queue configuration must be PrioritizeAttributePrioritizer.
Then processing flowfiles based on priority value.
How can i set priority value for seperate flow files or which prioritizer in Nifi to be work for my case?

The PriorityAttributePrioritizer prioritizes flow files by looking for a flow file attribute named "priority" and sorting the flow files lexicographically based on the value of the priority.
You can set the priority attribute using an UpdateAttribute processor. For example, if you had three logical data feeds, and feed #1 was most important, feed #2 was second most important, and feed #3 was third, then you could use three UpdateAttribute processors to set the priority attribute to 1, 2, and 3, then use a funnel to converge them all.
You would set the PriorityAttributePrioritizer on the queue between the funnel and the next processor, and at this point any time a flow file with priority=1 hits the queue, it will always be processed before any flow files with priority=2 and priority=3.
Determining how to set the priority really depends on your data. It is usually based on something about the data, like a field from each flow file that is extracted to an attribute to tell it the priority, or just knowing that everything that comes from source #1 is higher priority than what comes from source #2. Setting randomly unique priorities doesn't really make sense because you don't even know what you are prioritizing on then.

If the files are named after the time they have been generated (e.g. file_2017-03-03T010101.csv), have you considered using UpdateAttributes to parse the filename into a date, that date into Epoch (which happens to be an increasing number) as a first level index / prioritizer?
This way you could have:
GetFile (single thread) -- Connector with FIFO --> UpdateAttribute (adding Epoch from filename date) -- Connector with PriorityAttributePrioritizer --> rest of your flow
Assuming the file name is file_2017-03-03T010101.csv, the expression language would be something like:
${filename:toDate("'file_'yyyy-MM-dd'T'HHmmss'.csv'", "UTC"):toNumber()}

The PriorityAttributePrioritizer prioritizes flow files by looking for a flow file attribute named "priority" .I had file name appended with date ,so I added execute script and called groovy script to extract date from file name .Then these dates are sorted and flowfiles are iterated ,based on date sorting priority is incremented & added as flowfile attribute 'priority'.
Example :
Fileone : priority 1
Filetwo : priority 2
Nififlow :
Get file -> execute script (groovy-sort files,add priority attr)->change queue priority to PriorityAttributePrioritizer.
Above configuration will process priority 1 file first and then further file processing will be done respectively.

Related

Read flow file attribute/content to processor property

I want to set a property of a processor based on the contents of the last flowfile that came through.
Example: I instantiate the flowfile with the processor GenerateFlowFile and with the custom text ${now()} as the current timestamp during the creation of the flowFile.
I want to have a processor (which kind is irrelevant to me) to read the content of the flowfile (the timestamp) to the processor's custom property property_name. Afterwards I want to be able to potentially query the processor via the REST-API and read that property from the processor.
Initially I thought I could do that with the ExtractText processor, but it extracts text based on regex and writes it back to the flowfile, while I want to save that information in the processor until the next flowfile arrives.
You can't do it via NiFi. When the processor running you can't update its config.
Maybe you can use state variables on UpdateAttribute?
Stateful Usage
By selecting "store state locally" option for the "Store State"
property UpdateAttribute will not only store the evaluated properties
as attributes of the FlowFile but also as stateful variables to be
referenced in a recursive fashion. This enables the processor to
calculate things like the sum or count of incoming FlowFiles. A
dynamic property can be referenced as a stateful variable like so:
Dynamic Property key : theCount value :
${getStateValue("theCount"):plus(1)} This example will keep a count of
the total number of FlowFiles that have passed through the processor.
To use logic on top of State, simply use the "Advanced Usage" of
UpdateAttribute. All Actions will be stored as stateful attributes as
well as being added to FlowFiles. Using the "Advanced Usage" it is
possible to keep track of things like a maximum value of the flow so
far. This would be done by having a condition of
"${getStateValue("maxValue"):lt(${value})}" and an action of
attribute:"maxValue", value:"${value}". The "Stateful Variables
Initial Value" property is used to initialize the stateful variables
and is required to be set if running statefully. Some logic rules will
require a very high initial value, like using the Advanced rules to
determine the minimum value. If stateful properties reference other
stateful properties then the value for the other stateful properties
will be an iteration behind. For example, attempting to calculate the
average of the incoming stream requires the sum and count. If all
three properties are set in the same UpdateAttribute (like below) then
the Average will always not include the most recent values of count
and sum:
Count key : theCount value : ${getStateValue("theCount"):plus(1)} Sum> key : theSum value : ${getStateValue("theSum"):plus(${flowfileValue})}
Average key : theAverage value :
${getStateValue("theSum"):divide(getStateValue("theCount"))} Instead,
since average only relies on theCount and theSum attributes (which are
added to the FlowFile as well) there should be a following Stateless
UpdateAttribute which properly calculates the average. In the event
that the processor is unable to get the state at the beginning of the
onTrigger, the FlowFile will be pushed back to the originating
relationship and the processor will yield. If the processor is able to
get the state at the beginning of the onTrigger but unable to set the
state after adding attributes to the FlowFile, the FlowFile will be
transferred to "set state fail". This is normally due to the state not
being the most up to date version (another thread has replaced the
state with another version). In most use-cases this relationship
should loop back to the processor since the only affected attributes
will be overwritten. Note: Currently the only "stateful" option is to
store state locally. This is done because the current implementation
of clustered state relies on Zookeeper and Zookeeper isn't designed
for the type of load/throughput UpdateAttribute with state would
demand. In the future, if/when multiple different clustered state
options are added, UpdateAttribute will be updated.
Thanks to #Ivan I was able to create a full working solution - for future reference:
Instantiate flowfiles with e.g. a GenerateFlowFile processor and add a custom property "myproperty" and value ${now()} (note: you can add this property to the flow files in any processor, doesn't have to be a GenerateFlowFile processor)
Have a UpdateAttribute processor with the option (under processor properties) Store State set to Store state locally.
Add a custom property in the UpdateAttribute processor with the name readable_property and set it to the value ${'myproperty'}.
The state of the processor now contains the value of the last flowfile (e.g. with a timestamp of when the attribute was added to the flowfile).
Added Bonus:
Get the value of the stateful processor (and hence the value of the last flowfile that passed through (!) ) via the REST-API and a GET on the URI /nifi-api/processors/{id}/state
The JSON which gets returned contains the following lines:
{
"key":"readable_property"
,"value":"Wed Apr 14 11:13:40 CEST 2021"
,"clusterNodeId":"some-id-0d8eb6052"
,"clusterNodeAddress":"some-host:port-number"
}
Then you just have to parse the JSON for the value.
You should use UpdateAttribute processor.
You can read several methods - f.e. Update attributes based on content in NiFi

NiFi: Get all the processors name involved in a particular run

I have a nifi template of 30 processors. There are multiple conditional branches are there in the template. Now, I want to add something at the end of template so that I can get the list of all processors name which has executed for a particular run.
How can do this?
Thanks,
You could technically insert an UpdateAttribute processor after every "operational" processor which would add an attribute containing the most recent processor, but #Bryan is correct that the provenance feature exists to provide this information automatically. If you need to operate on it, you can use the SiteToSiteProvenanceReportingTask to send that data to a Remote Process Group (linked to an Input Port on the same instance) and then treat that data as any other in NiFi and examine/transform it.

NiFi - Choose Queue to execute

Suppose you have an ExecuteScript processor in a NiFi flow.
This processor has 2 incoming queues.
Is there a way to choose from which Queue session.get() will pull the flowfile?
Thanks.
There's no direct way via the API to identify which queue a flow file is coming from. However you can try this:
Add an UpdateAttribute to each upstream flow before ExecuteScript. For each branch, add the same attribute with a different value, say "queue.name" = "A" for one and "queue.name" = "B" for the other
In ExecuteScript you can pass a FlowFileFilter to session.get(), to fetch flow file(s) whose queue.name attribute is "A" or "B". Note that you may get an empty list, and if you need at least one flow file to continue, you can just return if the list is empty.

Access to queue attributes?

I have a number of GenerateTableFetch processors that send Flowfiles to a downstream UpdateAttributes processor. From the UpdateAttributes, the Flowfile is passed to an ExecuteSQL processor:
Is there any way to add an attribute to a flow file coming off a queue with the position of that Flowfile in the queue? For example, After I reset/clear the state for a GenerateTableFetch, I would like to know if this is the first batch of Flowfiles coming from GenerateTableFetch. I can see the position of the FlowFile in the queue, but it would nice is there's a way that I could add that as an attribute that is passed downstream. Is this possible?
This is not an available feature in Apache NiFi. The position of a flowfile in a queue is dynamic, and will change as flowfiles are removed from the queue, either by downstream processing or by flowfile expiration.
If you are simply trying to determine if the queue was empty before a specific flowfile was added, your best solution at this time is probably to use an ExecuteScript processor to get the desired connection via the REST API, then use FlowFileQueue#isActiveQueueEmpty() to determine if the specified queue is currently empty, and add a boolean attribute to the flowfile indicating it is the "first of a batch" or whatever logic you want to apply.
"Batches" aren't really a NiFi concept. Is there a specific action you want to take with the "first" flowfile? Perhaps there is other logic (i.e. the ExecuteSQL processor hasn't operated on a flowfile in x seconds, etc.) that could trigger your desired behavior.

why do we use tibco mapper activity?

The tibco documentation says
The Mapper activity adds a new process variable to the process definition. This variable can be a simple datatype, a TIBCO ActiveEnterprise schema, an XML schema, or a complex structure.
so my question is tibco mapper does only this simple function.We can create process variables in process definition also(by right clicking on process definition).I looked for it in google but no body clearly explains why to use this activity and I have also tried in youtube and there also only one video and it does not explain clearly.I am looking for an example how it is used in large organizations and a real time example.Thanks in advance
The term "process variable" is a bit overloaded I guess:
The process variables that you define in the Process properties are stateful. You can use (read) their values anywhere in the process and you can change their values during the process using the Assign task (yellow diamond with a black equals sign).
The mapper activity produces a new output variable of that task that you can only use (read) in activities that are downstream from it. You cannot change its value after the mapper activity, as for any other activity's output.
The mapper activity is mainly useful to perform complex and reusable data mappings in it rather than in the mappers of other activities. For example, you have a process that has to map its input data into a different data structure and then has to both send this via a JMS message and log it to a file. The mapper allows you to perform the mapping only once rather than doing it twice (both in the Send JMS and Write to File activity).
You'll find that in real world projects, the mapper activity is quite often used to perform data mapping independently of other activities, it just gives a nicer structure to the processes. In contrast the Process Variables defined in the Process properties together with the Assign task are used much less frequently.
Here's a very simple example, where you use the mapper activity once to set a process variable (here the filename) and then use it in two different following activities (create CSV File and Write File). Obviously, the mapper activity becomes more interesting if the mapping is not as trivial as here (though even in this simple example, you only have one place to change how the filename is generated rather than two):
Mapper Activiy
First use of the filename variable in Create File
Second use of the filename variable in Write File
Process Variable/Assign Activity Vs Mapper Activity
The primary purpose of an assign task is to store a variable at a process level. Any variable in an assign task can be modified N times in a process. But a mapper is specifically used for introducing a new variable. We cannot change the same mapper variable multiple times in a project.
Memory is allocated to Process Variable when the process instance is created but in case of TIBCO Mapper the memory is allocated only when the mapper activity is executed in a process instance.
Process Variable is allocated a single slot of memory which is used to update/modify the schema thought the process instance execution i.e. N number of assign activity will access same memory allocated to the variable. Whereas using N mapper for a same schema will create N amount of memory.
Assign Activity can be is used to accumulate the output of a tibco activity inside a group.

Resources