I inherited a large nifi flow whose predecessor was enamored with the ExecuteScript processor. As I review the scripts, they are setting custom attributes that are (sometimes) used later in the flow. I've been laboriously examining each processor to track where these attributes are used. However, this is quite labor intensive and I'm hoping there is a better way to determine what attributes are really used. I realize this won't help if the custom attributes are used by other ExecuteScript processors, but it would reduce the search space if I could get a list attributes matched to the processors that use them.
I've tried using the REST api. If I know the UUID for each processor I can programmatically get the json configuration and parse out the attributes from that. However, I don't see an API call that will list all the processors and their UUIDs.
Any suggestions are much appreciated.
Related
When creating Apache NiFi controller services, I'm interested in hearing about when it makes sense to create new ones and when to re-share existing ones.
Currently I have a CsvReader and CSVRecordSetWriter at the root process group and they are reused heavily in child process groups. I have tried to set them up to be as dynamic and flexible as possible to cover the widest number of use cases possible. I am setting the Schema Text property in each currently like this:
Reader Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.reader})}
Writer Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.writer})}
A very common pattern I have is to map files with different fields from different sources into a common format (common schema). So one thought is to use the ConvertRecord or UpdateRecord processors with avro.schema.reader and avro.schema.writerattributes set to the input and output schemas. Then I would have the writer always set the avro.schema attribute so any time I read records again further along in a flow it would default to using avro.schema. This feels dirty to leave the reader and writer schema attributes hanging around. Is there a better way from an architecture standpoint? Why have tons of controller services hanging around at different levels? Aside from some settings that may need to be different for some use cases, am I missing anything?
Also curious in hearing about how others organize their schemas? I don't have a need to reuse them at disparate locations across different processor blocks or reference different versions so it seems like a waste to centralize them or maintain a schema registry server that will also require upgrades and maintenance when I can just use AvroSchemaRegistry.
In the end, I decided it made more sense to split the controller into two controllers. One for conversions from Schema A to Schema B and another for using the same avro.schema property as normal/default readers and writers do when adding new ones. This allows for explicitly choosing the right pattern at processor block configuration time rather than relying on the implicit configuration of a single processor. Plus you get the added benefit of not stopping all flows (just a subset) when you only need to tweak settings on one of those two patterns.
My current understanding is that NiFi processor properties are specific to that processor. So adding a new property to a processor will only be visible within that processor and not be passed on to later processor blocks?
This is why UpdateAttribute is necessary to add metadata that stays with the flowfile as it traverses through the data flow:
So what is the value in allowing the user to add custom properties in a processor beyond the ones defined and required for that processor to execute? Is it analogous to creating variables that can then be used in other properties?
A very good question and one that comes to everyone's mind when they start working on building data-flows in NiFi.
First things first: Properties vs FlowFile Attributes
As you yourself have mentioned in your question itself, Properties are something that are used to control the behavior of your Processor while Attributes are metadata of your flow-in-action.
A simple example, lets take GetFile processor. The properties it exposes like Input Directory, File Filter, etc., tell your processor where & how to look for the source data. When the processor successfully finds some source matching your configuration, it initiates the flow, meaning a FlowFile is generated. This FlowFile will carry the content of the source data plus some metadata of the source such as the name of the file, size of the file, last modified time, etc., This metadata can actually help you down the flow with your subsequent processors like checking the file's type and route the FlowFile accordingly. And mind you, the metadata are not fixed; it differs with the different processors.
There are few core attributes which every processor would add like application.type, filesize, uuid, path, etc.,
What is purpose of letting users add custom properties when they are not added to the attributes?
It is a feature that NiFi offers to processors which they can use or ignore. Not all processors allow custom properties to be added. Only selective processors do.
Let's take InvokeHttp as an example. This processor allows the developer to create custom properties. When a user adds a new custom property, that property is added as a header to the HTTP call which the processor is going to make because the processor is built that way. It looks for any dynamic (custom) properties. If they are present, it will be considered as custom header(s) the user wants to send.
At least, in this processor's context, it doesn't make sense to capture this header data as a metadata because it may not be useful for the subsequent processors but there are certain other processors that act differently when custom properties are provided, like UpdateAttribute whose sole purpose is add any custom property as an attribute to the incoming FlowFile.
I have a question about Nifi and its capabilities as well as the appropriate use case for it.
I've read that Nifi is really aiming to create a space which allows for flow-based processing. After playing around with Nifi a bit, what I've also come to realize is it's capability to model/shape the data in a way that is useful for me. Is it fair to say that Nifi can also be used for data modeling?
Thanks!
Data modeling is a bit of an overloaded term, but in the context of your desire to model/shape the data in a way that is useful for you, it sounds like it could be a viable approach. The rest of this is under that assumption.
While NiFi employs dataflow through principles and design closely related to flow based programming (FBP) as a means, the function is a matter of getting data from point A to B (and possibly back again). Of course, systems aren't inherently talking in the same protocols, formats, or schemas, so there needs to be something to shape the data into what the consumer is anticipating from what the producer is supplying. This gets into common enterprise integration patterns (EIP) [1] such as mediation and routing. In a broader sense though, it is simply getting the data to those that need it (systems, users, etc) when and how they need it.
Joe Witt, one of the creators of NiFi, gave a great talk that may be in line with this idea of data shaping in the context of Data Science at a Meetup. The slides of which are available [2].
If you have any additional questions, I would point you to check out the community mailing lists [3] and ask any additional questions so you can dig in more and get a broader perspective.
[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Patterns
[2] http://files.meetup.com/6195792/ApacheNiFi-MD_DataScience_MeetupApr2016.pdf
[3] http://nifi.apache.org/mailing_lists.html
Data modeling might well mean many things to many folks so I'll be careful to use that term here. What I do think in what you're asking is very clear is that Apache NiFi is a great system to use to help mold the data into the right format and schema and content you need for your follow-on analytics and processing. NiFi has an extensible model so you can add processors that can do this or you can use the existing processors in many cases and you can even use the ExecuteScript processors as well so you can write scripts on the fly to manipulate the data.
I have a situation involving custom processors participating in various stages of the ETL pipeline:
Processor_1 -> Processor_2 -> Processor_3
For some use cases, Processor_3 needs to refer to the original data input to Processor_1. There is the concept of parent reference in the flowfile definition.
The crude method is to pass the orginal data with the flow file from Processor_1 to Processor_2 to Processor_3. Is there a mechanism for downstream processors to reference the original data?
Neel, what does the object structure of your content look like? Could you create an original and modified version within the flowfile contents that you could access, if needed? Then down stream extract the version you need?
Alternatively, you could stuff the original contents in an attribute and reference them, if needed in processor 3.
There might be a NIFI way of doing this, but not sure what that would be (maybe write to the NIFI cache?).
--S
I am currently in the process of writing an ElasticSearch Nifi processor. Individual inserts / writes to ES are not optimal, instead batching documents is preferred. What would be considered the optimal approach within a Nifi processor to track (batch) documents (FlowFiles) and when at a certain amount batch them in? The part I am most concerned about is if ES is unavailable, down, network partition, etc. prevents the batch from being successful. The primary point of the question, is given that Nifi has content storage for queuing / back-pressure, etc. is there a preferred method for using that to ensure no FlowFiles get lost if a destination is down? Maybe there is another processor I should look at for an example?
I have looked at the Mongo processor, Merge, etc. to try and get an idea of the preferred approach for batching inside of a processor, but can't seem to find anything specific. Any suggestions would be appreciated.
Good chance I am overlooking some basic functionality baked into Nifi. I am still fairly new to the platform.
Thanks!
Great question and a pretty common pattern. This is why we have the concept of a ProcessSession. It allows you to send zero or more things to an external endpoint and only commit once you know it has been ack'd by the recipient. In this sense it offers at least-once semantics. If the protocol you're using supports two-phase commit style semantics you can get pretty close to the ever elusive exactly-once semantic. Much of the details of what you're asking about here will depend on the destination systems API and behavior.
There are some examples in the apache codebase which reveal ways to do this. One way is if you can produce a merged collection of events prior to pushing to the destination system. Depends on its API. I think PutMongo and PutSolr operate this way (though the experts on that would need to weigh in). An example that might be more like what you're looking for can be found in PutSQL which operates on batches of flowfiles to send in a single transaction (on the destination DB).
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/PutSQL.java
Will keep an eye here but can get the eye of a larger NiFi group at users#nifi.apache.org
Thanks
Joe