My current understanding is that NiFi processor properties are specific to that processor. So adding a new property to a processor will only be visible within that processor and not be passed on to later processor blocks?
This is why UpdateAttribute is necessary to add metadata that stays with the flowfile as it traverses through the data flow:
So what is the value in allowing the user to add custom properties in a processor beyond the ones defined and required for that processor to execute? Is it analogous to creating variables that can then be used in other properties?
A very good question and one that comes to everyone's mind when they start working on building data-flows in NiFi.
First things first: Properties vs FlowFile Attributes
As you yourself have mentioned in your question itself, Properties are something that are used to control the behavior of your Processor while Attributes are metadata of your flow-in-action.
A simple example, lets take GetFile processor. The properties it exposes like Input Directory, File Filter, etc., tell your processor where & how to look for the source data. When the processor successfully finds some source matching your configuration, it initiates the flow, meaning a FlowFile is generated. This FlowFile will carry the content of the source data plus some metadata of the source such as the name of the file, size of the file, last modified time, etc., This metadata can actually help you down the flow with your subsequent processors like checking the file's type and route the FlowFile accordingly. And mind you, the metadata are not fixed; it differs with the different processors.
There are few core attributes which every processor would add like application.type, filesize, uuid, path, etc.,
What is purpose of letting users add custom properties when they are not added to the attributes?
It is a feature that NiFi offers to processors which they can use or ignore. Not all processors allow custom properties to be added. Only selective processors do.
Let's take InvokeHttp as an example. This processor allows the developer to create custom properties. When a user adds a new custom property, that property is added as a header to the HTTP call which the processor is going to make because the processor is built that way. It looks for any dynamic (custom) properties. If they are present, it will be considered as custom header(s) the user wants to send.
At least, in this processor's context, it doesn't make sense to capture this header data as a metadata because it may not be useful for the subsequent processors but there are certain other processors that act differently when custom properties are provided, like UpdateAttribute whose sole purpose is add any custom property as an attribute to the incoming FlowFile.
Related
I inherited a large nifi flow whose predecessor was enamored with the ExecuteScript processor. As I review the scripts, they are setting custom attributes that are (sometimes) used later in the flow. I've been laboriously examining each processor to track where these attributes are used. However, this is quite labor intensive and I'm hoping there is a better way to determine what attributes are really used. I realize this won't help if the custom attributes are used by other ExecuteScript processors, but it would reduce the search space if I could get a list attributes matched to the processors that use them.
I've tried using the REST api. If I know the UUID for each processor I can programmatically get the json configuration and parse out the attributes from that. However, I don't see an API call that will list all the processors and their UUIDs.
Any suggestions are much appreciated.
When creating Apache NiFi controller services, I'm interested in hearing about when it makes sense to create new ones and when to re-share existing ones.
Currently I have a CsvReader and CSVRecordSetWriter at the root process group and they are reused heavily in child process groups. I have tried to set them up to be as dynamic and flexible as possible to cover the widest number of use cases possible. I am setting the Schema Text property in each currently like this:
Reader Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.reader})}
Writer Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.writer})}
A very common pattern I have is to map files with different fields from different sources into a common format (common schema). So one thought is to use the ConvertRecord or UpdateRecord processors with avro.schema.reader and avro.schema.writerattributes set to the input and output schemas. Then I would have the writer always set the avro.schema attribute so any time I read records again further along in a flow it would default to using avro.schema. This feels dirty to leave the reader and writer schema attributes hanging around. Is there a better way from an architecture standpoint? Why have tons of controller services hanging around at different levels? Aside from some settings that may need to be different for some use cases, am I missing anything?
Also curious in hearing about how others organize their schemas? I don't have a need to reuse them at disparate locations across different processor blocks or reference different versions so it seems like a waste to centralize them or maintain a schema registry server that will also require upgrades and maintenance when I can just use AvroSchemaRegistry.
In the end, I decided it made more sense to split the controller into two controllers. One for conversions from Schema A to Schema B and another for using the same avro.schema property as normal/default readers and writers do when adding new ones. This allows for explicitly choosing the right pattern at processor block configuration time rather than relying on the implicit configuration of a single processor. Plus you get the added benefit of not stopping all flows (just a subset) when you only need to tweak settings on one of those two patterns.
It seems, the way "list" processor works, we can't put them in middle of a flow. Then how to set attributes to ListHDFS? e.g. if I want to parameterize "directory", and pass it at runtime.
You can use expression language in the Directory property, but only to reference variables, system properties, or dynamic EL expressions, but not flow file attributes.
ListHDFS (and the other List processors) are made to track state and determine which files are new from the previous listing, so if you were allowed to specify the directory from an incoming flow file this would mean that the directory could change at any moment which then makes the previous state no longer meaningful, or would mean that it would need to track state for N number of directories which could grow large and would be unclear when a directory was no longer being listed.
It may be helpful to implement another processor that allows dynamic listings, but does not store state,
In case, someone runs into this question, GetHDFSFileInfo would be what you are looking for.
I see some processors have both GetXXX and ConsumeXXX variants (GetKafka an ConsumeKafka for example). When to a name a processor with GetXXX over ConsumeXXX?
Note: I understand the technical differences between ConsumeKafka and GetKafka. My question is mainly on the naming convention.
There is no real meaning behind the names and technically they could be named anything. The typical convention is to start the processor name with a verb that describes the action being performed, followed by the system/thing being acted on.
"Get" processors are typically processors that have no incoming connection and pull data from some external source, and "Put" processors are typically processors that deliver data to an external system.
When the first Kafka processors were developed using the 0.8 Kafka client, they were called GetKafka and PutKafka. The community then wanted to also support Kafka 0.9 at the same time, so ConsumeKafka and PublishKafka were implemented which better aligned with Kafka's terminology, and also provided another name since they couldn't also be called GetKafka and PutKafka.
I have a situation involving custom processors participating in various stages of the ETL pipeline:
Processor_1 -> Processor_2 -> Processor_3
For some use cases, Processor_3 needs to refer to the original data input to Processor_1. There is the concept of parent reference in the flowfile definition.
The crude method is to pass the orginal data with the flow file from Processor_1 to Processor_2 to Processor_3. Is there a mechanism for downstream processors to reference the original data?
Neel, what does the object structure of your content look like? Could you create an original and modified version within the flowfile contents that you could access, if needed? Then down stream extract the version you need?
Alternatively, you could stuff the original contents in an attribute and reference them, if needed in processor 3.
There might be a NIFI way of doing this, but not sure what that would be (maybe write to the NIFI cache?).
--S