I have a situation involving custom processors participating in various stages of the ETL pipeline:
Processor_1 -> Processor_2 -> Processor_3
For some use cases, Processor_3 needs to refer to the original data input to Processor_1. There is the concept of parent reference in the flowfile definition.
The crude method is to pass the orginal data with the flow file from Processor_1 to Processor_2 to Processor_3. Is there a mechanism for downstream processors to reference the original data?
Neel, what does the object structure of your content look like? Could you create an original and modified version within the flowfile contents that you could access, if needed? Then down stream extract the version you need?
Alternatively, you could stuff the original contents in an attribute and reference them, if needed in processor 3.
There might be a NIFI way of doing this, but not sure what that would be (maybe write to the NIFI cache?).
--S
Related
I inherited a large nifi flow whose predecessor was enamored with the ExecuteScript processor. As I review the scripts, they are setting custom attributes that are (sometimes) used later in the flow. I've been laboriously examining each processor to track where these attributes are used. However, this is quite labor intensive and I'm hoping there is a better way to determine what attributes are really used. I realize this won't help if the custom attributes are used by other ExecuteScript processors, but it would reduce the search space if I could get a list attributes matched to the processors that use them.
I've tried using the REST api. If I know the UUID for each processor I can programmatically get the json configuration and parse out the attributes from that. However, I don't see an API call that will list all the processors and their UUIDs.
Any suggestions are much appreciated.
I have a processor that generates time series data in JSON format. Based on the received data I need to make a forecast using machine learning algorithms on python. Then write the new forecast values to another flow file.
The problem is: when you run such a python script, it must perform many massive preprocessing operations: queries to a database, creating a complex data structure, initializing forecasting models, etc.
If you use ExecuteStreamCommand, then for each flow file the script will be run every time. Is this true?
Can I make in NIFI a python script that starts once and receives the flow files many times, storing the history of previously received data. Or do I need to make an HTTP service that will receive data from NIFI?
You have a few options:
Build a custom processor. This is my suggested approach. The code would need to be in Java (or Groovy, which provides a more Python-like experience) but would not have Python dependencies, etc. However, I have seen examples of this approach for ML model application (see Tim Spann's examples) and this is generally very effective. The initialization and individual flowfile trigger logic is cleanly separated, and performance is good.
Use InvokeScriptedProcessor. This will allow you to write the code in Python and separate the initialization (pre-processing, DB connections, etc., onScheduled in NiFi processor parlance) with the execution phase (onTrigger). Some examples exist but I have not personally pursued this with Python specifically. You can use Python dependencies but not "native modules" (i.e. compiled C code), as the execution engine is still Jython.
Use ExecuteStreamCommand. Not strongly recommended. As you mention, every invocation would require the preprocessing steps to occur, unless you designed your external application in such a way that it ran a long-lived "server" component and each ESC command sent data to it and returned an individual response. I don't know what your existing Python application looks like, but this would likely involve complicated changes. Tim has another example using CDSW to host and deploy the model and NiFi to send it data via HTTP to evaluate.
Make a Custom Processor that can do that. Java is more appropriate. I believe you can do pretty much every with Java you just need to find libraries. Yes, there might be some issues with some initialization and preprocessing that can be handled by all that in the init function of nifi that will allow you preserve the state of certain components.
Link in my use case I had to build a custom processor that could take in images and apply count the number of people in that image. For that, I had to load a deep learning model once in the init method and after through on trigger method, it could be taking the reference of that model every time it processes an image.
When creating Apache NiFi controller services, I'm interested in hearing about when it makes sense to create new ones and when to re-share existing ones.
Currently I have a CsvReader and CSVRecordSetWriter at the root process group and they are reused heavily in child process groups. I have tried to set them up to be as dynamic and flexible as possible to cover the widest number of use cases possible. I am setting the Schema Text property in each currently like this:
Reader Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.reader})}
Writer Schema Text: ${avro.schema:notNull():ifElse(${avro.schema}, ${avro.schema.writer})}
A very common pattern I have is to map files with different fields from different sources into a common format (common schema). So one thought is to use the ConvertRecord or UpdateRecord processors with avro.schema.reader and avro.schema.writerattributes set to the input and output schemas. Then I would have the writer always set the avro.schema attribute so any time I read records again further along in a flow it would default to using avro.schema. This feels dirty to leave the reader and writer schema attributes hanging around. Is there a better way from an architecture standpoint? Why have tons of controller services hanging around at different levels? Aside from some settings that may need to be different for some use cases, am I missing anything?
Also curious in hearing about how others organize their schemas? I don't have a need to reuse them at disparate locations across different processor blocks or reference different versions so it seems like a waste to centralize them or maintain a schema registry server that will also require upgrades and maintenance when I can just use AvroSchemaRegistry.
In the end, I decided it made more sense to split the controller into two controllers. One for conversions from Schema A to Schema B and another for using the same avro.schema property as normal/default readers and writers do when adding new ones. This allows for explicitly choosing the right pattern at processor block configuration time rather than relying on the implicit configuration of a single processor. Plus you get the added benefit of not stopping all flows (just a subset) when you only need to tweak settings on one of those two patterns.
My current understanding is that NiFi processor properties are specific to that processor. So adding a new property to a processor will only be visible within that processor and not be passed on to later processor blocks?
This is why UpdateAttribute is necessary to add metadata that stays with the flowfile as it traverses through the data flow:
So what is the value in allowing the user to add custom properties in a processor beyond the ones defined and required for that processor to execute? Is it analogous to creating variables that can then be used in other properties?
A very good question and one that comes to everyone's mind when they start working on building data-flows in NiFi.
First things first: Properties vs FlowFile Attributes
As you yourself have mentioned in your question itself, Properties are something that are used to control the behavior of your Processor while Attributes are metadata of your flow-in-action.
A simple example, lets take GetFile processor. The properties it exposes like Input Directory, File Filter, etc., tell your processor where & how to look for the source data. When the processor successfully finds some source matching your configuration, it initiates the flow, meaning a FlowFile is generated. This FlowFile will carry the content of the source data plus some metadata of the source such as the name of the file, size of the file, last modified time, etc., This metadata can actually help you down the flow with your subsequent processors like checking the file's type and route the FlowFile accordingly. And mind you, the metadata are not fixed; it differs with the different processors.
There are few core attributes which every processor would add like application.type, filesize, uuid, path, etc.,
What is purpose of letting users add custom properties when they are not added to the attributes?
It is a feature that NiFi offers to processors which they can use or ignore. Not all processors allow custom properties to be added. Only selective processors do.
Let's take InvokeHttp as an example. This processor allows the developer to create custom properties. When a user adds a new custom property, that property is added as a header to the HTTP call which the processor is going to make because the processor is built that way. It looks for any dynamic (custom) properties. If they are present, it will be considered as custom header(s) the user wants to send.
At least, in this processor's context, it doesn't make sense to capture this header data as a metadata because it may not be useful for the subsequent processors but there are certain other processors that act differently when custom properties are provided, like UpdateAttribute whose sole purpose is add any custom property as an attribute to the incoming FlowFile.
I have a use case where I have millions of small files in S3 which needs to be processed by Spark. I have two options to reduce number of tasks:
1. Use Coalesce
2. Extend CombineFileInputFormat
But I'm not clear of performance implications with bot and when to use one over other.
Also, CombineFileInputFormat is an abstract class, that means I need to provide my implementation. But Spark API (newAPIHadoopRDD) takes the class name as param, I'm not sure how to pass configurable maxSplitSize
Another great option to consider for such scenarios is SparkContext.wholeTextFiles() which makes one record for each file with its name as the key and the content as the value -- see Documentation