NiFi fetchFile processor doesn't allow dynamic attributes - apache-nifi

What is the reason that some of NiFi processors don't allow dynamic attributes? I'm using FetchFile processor in one of my workflows and I need to pass through some data throughout the flow to be able to use it in the last step. However, FetchFile breaks it by not allowing dynamic attributes. I'm wondering if there is another way to do it? Why would NiFi not allow dynamic attributes on certain processors?
My flow is something like
ExecuteScript -> EvaluateJSon -> Custom Processor to write files-> FetchFile->SendtoS3 -> Mark workflow complete
I want to send some metadata so that I could mark the workflow complete. I'm passing that data as attributes but it breaks at FetchFile.

There are two separate concepts, user-defined properties on processors, and flow file attributes.
User-defined properties let a processor take input from a user for something that couldn't be defined ahead of time. Examples of this are in EvaluateJsonPath when the JSON paths are specified in user-defined properties, or in PutSolrContentStream when all the user-defined properties get passed as query parameters to Solr.
FlowFile attributes are a map of key/value pairs that get passed around with each piece of data. These attributes are usually created when a processor produces or modifies a flow file, or can be manipulated using processors like UpdateAttribute.
It is up to each processor to decide whether it needs user-defined properties and how they would be used. UpdateAttribute happens to be a processor where the user-defined properties are added as new key/value pairs to each flow file, but it doesn't make sense for every processor to do that.

Related

Read flow file attribute/content to processor property

I want to set a property of a processor based on the contents of the last flowfile that came through.
Example: I instantiate the flowfile with the processor GenerateFlowFile and with the custom text ${now()} as the current timestamp during the creation of the flowFile.
I want to have a processor (which kind is irrelevant to me) to read the content of the flowfile (the timestamp) to the processor's custom property property_name. Afterwards I want to be able to potentially query the processor via the REST-API and read that property from the processor.
Initially I thought I could do that with the ExtractText processor, but it extracts text based on regex and writes it back to the flowfile, while I want to save that information in the processor until the next flowfile arrives.
You can't do it via NiFi. When the processor running you can't update its config.
Maybe you can use state variables on UpdateAttribute?
Stateful Usage
By selecting "store state locally" option for the "Store State"
property UpdateAttribute will not only store the evaluated properties
as attributes of the FlowFile but also as stateful variables to be
referenced in a recursive fashion. This enables the processor to
calculate things like the sum or count of incoming FlowFiles. A
dynamic property can be referenced as a stateful variable like so:
Dynamic Property key : theCount value :
${getStateValue("theCount"):plus(1)} This example will keep a count of
the total number of FlowFiles that have passed through the processor.
To use logic on top of State, simply use the "Advanced Usage" of
UpdateAttribute. All Actions will be stored as stateful attributes as
well as being added to FlowFiles. Using the "Advanced Usage" it is
possible to keep track of things like a maximum value of the flow so
far. This would be done by having a condition of
"${getStateValue("maxValue"):lt(${value})}" and an action of
attribute:"maxValue", value:"${value}". The "Stateful Variables
Initial Value" property is used to initialize the stateful variables
and is required to be set if running statefully. Some logic rules will
require a very high initial value, like using the Advanced rules to
determine the minimum value. If stateful properties reference other
stateful properties then the value for the other stateful properties
will be an iteration behind. For example, attempting to calculate the
average of the incoming stream requires the sum and count. If all
three properties are set in the same UpdateAttribute (like below) then
the Average will always not include the most recent values of count
and sum:
Count key : theCount value : ${getStateValue("theCount"):plus(1)} Sum> key : theSum value : ${getStateValue("theSum"):plus(${flowfileValue})}
Average key : theAverage value :
${getStateValue("theSum"):divide(getStateValue("theCount"))} Instead,
since average only relies on theCount and theSum attributes (which are
added to the FlowFile as well) there should be a following Stateless
UpdateAttribute which properly calculates the average. In the event
that the processor is unable to get the state at the beginning of the
onTrigger, the FlowFile will be pushed back to the originating
relationship and the processor will yield. If the processor is able to
get the state at the beginning of the onTrigger but unable to set the
state after adding attributes to the FlowFile, the FlowFile will be
transferred to "set state fail". This is normally due to the state not
being the most up to date version (another thread has replaced the
state with another version). In most use-cases this relationship
should loop back to the processor since the only affected attributes
will be overwritten. Note: Currently the only "stateful" option is to
store state locally. This is done because the current implementation
of clustered state relies on Zookeeper and Zookeeper isn't designed
for the type of load/throughput UpdateAttribute with state would
demand. In the future, if/when multiple different clustered state
options are added, UpdateAttribute will be updated.
Thanks to #Ivan I was able to create a full working solution - for future reference:
Instantiate flowfiles with e.g. a GenerateFlowFile processor and add a custom property "myproperty" and value ${now()} (note: you can add this property to the flow files in any processor, doesn't have to be a GenerateFlowFile processor)
Have a UpdateAttribute processor with the option (under processor properties) Store State set to Store state locally.
Add a custom property in the UpdateAttribute processor with the name readable_property and set it to the value ${'myproperty'}.
The state of the processor now contains the value of the last flowfile (e.g. with a timestamp of when the attribute was added to the flowfile).
Added Bonus:
Get the value of the stateful processor (and hence the value of the last flowfile that passed through (!) ) via the REST-API and a GET on the URI /nifi-api/processors/{id}/state
The JSON which gets returned contains the following lines:
{
"key":"readable_property"
,"value":"Wed Apr 14 11:13:40 CEST 2021"
,"clusterNodeId":"some-id-0d8eb6052"
,"clusterNodeAddress":"some-host:port-number"
}
Then you just have to parse the JSON for the value.
You should use UpdateAttribute processor.
You can read several methods - f.e. Update attributes based on content in NiFi

NiFi Flow for Record Enrichment

I am using NiFi 1.11.4 to build a data pipeline where IoT device is sending data in JSON format. Each time I receive data from IoT device, I receive two JSONs;
JSON_INITIAL
{
devId: "abc",
devValue: "TWOINITIALCHARS23",
}
and
JSON_FINAL
{
devId: "abc",
devValue: "TWOINITIALCHARS45",
}
There is a time difference of a few milli seconds with which I receive these two flow files. In my usecase, I need to merge this JSON in such a way that my resultant JSON looks like below (please note removal of TWOINITIALCHARS in both cases;
JSON_RESULT_AFTER_MERGE
{
devId: "abc",
devValue: "2345",
}
Is this something NiFi should be dealing with? If yes, would really appreciate an approach to design relevant flow for this use case.
Assuming the devId is static for a device and not used for the correlation (i.e. abc for all messages coming from this device, not abc for the first two and then def for the next two, etc.), you have a few options:
Use MergeContent to concatenate the flowfile contents (the two JSON blocks) and ReplaceText to modify the combined content to match the desired output. This will require tuning the MC binning properties to limit the merge window to 1-2 seconds (difficult/insufficient if you're receiving multiple messages per second, for example) and using regular expressions to remove the duplicate content.
Use a custom script to interact with the device JSON output (Groovy for example will make the JSON interaction pretty simple)
If you do this within the context of NiFi (via ExecuteScript or InvokeScriptedProcessor), you will have access to the NiFi framework, so you can evaluate flowfile attributes and content, making this easier (there will be attributes for initial timestamp, etc.).
If you do this outside the context of NiFi (via ExecuteProcess or ExecuteStreamCommand), you won't have access to the NiFi framework (attributes, etc.) but you may have better interaction with the device directly.

NiFi How to get the current processor Name and Processor group name through the custom processor using (Java)

I'm Creating the NiFi Custom processor using Java,
one of the requirement is to get the previous processor name and processor group (like a breadcrumb) using java code.
The previous processor name and process group name is not immediately (nor meant to be) available to processors, can you explain more about your use case? You can perhaps use a SiteToSiteProvenanceReportingTask to send provenance information back to your own NiFi instance (an Input Port, e.g.) and find the events that correspond to FlowFiles entering your custom processor, the events should have the source (previous) processor and destination (your custom) processor.
If instead you code your custom processor using InvokeScriptedProcessor with Groovy for example, then you can "bend the rules" and get at the previous processor name and such, as Groovy allows access to private members and you can assume the implementation of the ProcessContext in onTrigger is an instance of StandardProcessContext, so you can get at its members which include upstream connections and thus the previous processor. For a particular FlowFile though, I'm not sure you can use this approach to know which upstream processor it came from.
Alternatively, you could add an UpdateAttribute after each "previous processor" to set attribute(s) with the information about that processor, but that has to be hardcoded and applied to every corresponding part of the flow.
I faced this some time back. I used InvokeHTTP processor and used nifi-api/process-groups/${process_group_id} Web Service
This is how I implemented:
Identify the process group where the error handling should be done. [Action Group]
Create a new process group [Error Handling Group] next to the Action Group and add relationship to transfer files to Error Handling Group.
Use the InvokeHTTP processor and set HTTP Method to GET
Set Remote URL to http://{nifi-instance}:{port}/nifi-api/process-groups/${action_group_process_group_id}
You will get response in JSON which you will have to customize according to your needs
Please let me know if you need the XML file that I am using. I can share that. It just works fine for me

In nifi usgae of Evaluate jsonpath processor will it affect performance impact because of attribute creation

I'm trying to integrate nifi REST API's with my application. So by mapping input and output from my application, I am trying to call nifi REST api for flow creation. So, in my use case most of the times I will extract the JSON values and will apply expression languages.
So, for simplifying all the use-cases I am using evaluate JSONpath processor for fetching all attributes using jsonpath and apply expression language function on that in extract processor. Below is the flow diagram regarding that.
Is it the right approach because for JSON to JSON manipulation having 30 keys this is the simplest way, and as I am trying to integrate nifi REST API's with my application I cannot generate JOLT transformation logic dynamically based on the user mapping.
So, in this case, does the usage of evaluating JSONpath processor creates any performance issues for about 50 use case with different transformation logic because as I saw in documentation attribute usage creates performance(regarding memory) issues.
Your concern about having too many attributes in memory should not be an issue here; having 30 attributes per flowfile is higher than usual, but if these are all strings between 0 - ~100-200 characters, there should be minimal impact. If you start trying to extract KB worth of data from the flowfile content to the attributes on each flowfile, you will see increased heap usage, but the framework should still be able to handle this until you reach very high throughput (1000's of flowfiles per second on commodity hardware like a modern laptop).
You may want to investigate ReplaceTextWithMapping, as that processor can load from a definition file and handle many replace operations using a single processor.
It is usually a flow design "smell" to have multiple copies of the same flow process with different configuration values (with the occasional exception of database interaction). Rather, see if there is a way you can genericize the process and populate the relevant values for each flowfile using variable population (from the incoming flowfile attributes, the variable registry, environment variables, etc.).

NiFi: Get all the processors name involved in a particular run

I have a nifi template of 30 processors. There are multiple conditional branches are there in the template. Now, I want to add something at the end of template so that I can get the list of all processors name which has executed for a particular run.
How can do this?
Thanks,
You could technically insert an UpdateAttribute processor after every "operational" processor which would add an attribute containing the most recent processor, but #Bryan is correct that the provenance feature exists to provide this information automatically. If you need to operate on it, you can use the SiteToSiteProvenanceReportingTask to send that data to a Remote Process Group (linked to an Input Port on the same instance) and then treat that data as any other in NiFi and examine/transform it.

Resources