Why is Flowfile Being Cloned? - apache-nifi

I was wondering why some processors are cloning the flow file before routing it to next processor.
For example, QueryDatabaseTable processor always clones the flow file before passing it to the ContentAvrotoORC processor.
Why is there a need to clone the Flowfile ?

I think the clone event represents the fact that the same flow is being transferred to two different destinations (LogMessage and ConvertAvroToORC). So there is one flow file created by QueryDatabaseTable, then when the framework sees that success goes two places, it has to clone the flow file.

Related

Apache nifi: Difference between the flowfile State and StateManagement

From what I've read here and there, the flowfile repository serves as a Write Ahead Log for apache Nifi.
When walking the configuration files, I've seen that there is a state-management configuration section. When in a Standalone mode, a local-provider is used and writes the state (by default) to .state/local/.
It seems like both the flowfile repo and the state are used both, for example, to recover from a system failure.
Would someone please explain what's the difference between them? Do they work together ?
Also, it's a best practice to have the flowfile repo and the content repo on two separate disks. What about the local state ? Should we avoid using the "boot" disk and offload to another one ? Which one: a dedicated ? Co-locate with another one (I'm co-locating database and flowfile repos).
Thanks.
The flow file repository keeps track of all the flow files in the system, which content they point to, which attributes they have, and where they are in the flow.
State Management is an API provided to processors/services that can be used to store and retrieve key/value pairs, typically for remembering where something left off. For example, a source processor that pulls data since some timestamp would want to store the last timestamp it used so that if NiFi restarts it can retrieve this value and start from there again.

what is a fastest way to remove nifi flowfile content?

I have a workflow where I am getting json files as a response of rest api. I am getting approximately 100k files in a session. total size of all the files is 15GB. I have to save each file to file system, which i am doing. at the end of the process I have to wait for all the files to be present before I send a success message.
Once I save the file in FS, I am calling notify+wait. but I dont need 15 gb data in flowfile anymore. So to release some space, I thought of using either replaceText or ModifyByte to clear content. so notify+wait runs smoothly. Total wait for this process is 3 hrs.
But process is taking too long in both (replaceText or ModifyByte) case.
Can you suggest, fastest way to clear flowfile data.I do not need any attributes too. so is thr a way I can abandon old flowfile and generate kb flowfile, midway?
what i want is something like generateflowfile, but in middle, so for each of my existing flowfile, i can drop old one, and generate blank flowfile for notify and wait.
Thanks
NiFi's Content Repository and FlowFile Repository are based on a copy-on-write mechanism, so if you don't change the contents or metadata, then you are not necessarily "keeping" the 15GB across those processors.
Having said that, if all you need is the existence of such flow files on disk (but not contents or metadata), try ExecuteScript with the following Groovy script:
def flowFiles = session.get(1000)
flowFiles.each {
session.transfer(session.create(), REL_SUCCESS)
}
session.remove(flowFiles)
This script will grab up to 1000 flow files at a time, and for each one, send an empty flow file downstream. It then removes all the original incoming flow files.
Note that this (i.e. your use case) will "break" the provenance/lineage chain, so if something goes wrong in your flow, you won't be able to tell which flow files came from which parent flow files, etc. This limitation is one reason why you don't see a full processor that performs this kind of function.
In case you need to keep the attributes, lineage and metadata you can use the following code (grabs only 1 flowfile at a time). The only thing that changes is the UUID, but otherwise everything is kept - except the content of course.
f = session.get()
session.transfer(session.create(f), REL_SUCCESS)
session.remove(f)

NiFi: Get all the processors name involved in a particular run

I have a nifi template of 30 processors. There are multiple conditional branches are there in the template. Now, I want to add something at the end of template so that I can get the list of all processors name which has executed for a particular run.
How can do this?
Thanks,
You could technically insert an UpdateAttribute processor after every "operational" processor which would add an attribute containing the most recent processor, but #Bryan is correct that the provenance feature exists to provide this information automatically. If you need to operate on it, you can use the SiteToSiteProvenanceReportingTask to send that data to a Remote Process Group (linked to an Input Port on the same instance) and then treat that data as any other in NiFi and examine/transform it.

Multiple flows with nifi

We have multiple (50+) nifi flows that all do basically the same thing: pull some data out of a db, append some columns conver to parquet and upload to hdfs. They differ only in details such as the sql query to run or the location in hdfs that they land.
The question is how to factor these common nifi flows out such that any change made to the common flow automatically applies to all all derived flows. E.g if i want to add an extra step to also publish the data to Kafka I want to make this once and have it automatically apply to all 50 flows.
We’ve tried to get this working with nifi registry, however it seems like an imperfect fit. Essentially the issue is that nifi registry seems to work well for updating a flow in one environment (say wat) and then autmatically updating it in another environment (say prod). It seems less suited for updating multiple flows in the same environment with one specific example bing that it will reset the name of each flow to be the template name every time we redeploy meaning that al flows end up with the same name!
Does anyone know how one is supposed to manage a situation like ours asi guess it must be pretty common.
Apache NiFi has ProcessorGroups. As the name itself suggests, the processor groups are there to group together a set of processors' and their pipeline that does similar task.
So for your case what you can do is, you can refactor the flow by moving the common flow which can be reused with different pipelines to a separate processor group with an input port. Connect the outside flow that depends on this reusable flow by connecting to the input port of the reusable processor group. Depending on your requirement you can create an output port as well in this processor group and connect it with the outside flow.
Attaching a sample:
For the sake of explaining, I have made a mock flow so ignore the Processor types that are used, but rather see the name I had given to those processors.
The following screenshots show that I read from two different sources and individually connect them to two different processors that does the source specific changes to those processors
Then I connect these two flows to the input port of a processor group that has the reusable flow inside. So ultimately the two different flows shown in the above screenshot gets to work with a common reusable flow.
Showing what's inside the reusable flow:
Finally the output port output to outside connects the reusable flow to the outside component Write to somewehere
I hope this helps you with refactoring your complex flows. Feel free to get back, if you have any queries.

How to solve the relationship failure?

I have a processor that appears to be creating FlowFiles correctly (modified a standard processor), but when it goes to commit() the session, an exception is raised:
2016-10-11 12:23:45,700 ERROR [Timer-Driven Process Thread-6] c.s.c.processors.files.GetFileData [GetFileData[id=8f5e644d-591c-4df1-8c79-feea118bd8c0]] Failed to retrieve files due to {}  org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord transfer relationship not specified
I'm assuming this is supposed to be indicating there's no connection available to commit the transfer; however, there is a "success" relationship registered during init() in same way as original processor did it, and the success relationship out is connected to another processor input as it should be.
Any suggestions for troubleshooting?
What changes did you make to the standard processor? If you are calling methods on the ProcessSession object, ensure that you are saving the latest "version" of the FlowFile returned from those method calls, and transfer only the latest version to "success".
FlowFile references are immutable; often in code you will see an initial reference like "flowFile" pointing at the incoming flow file (from session.get() for example), then it gets updated as the flow file is mutated, such as flowFile = session.putAttribute(flowFile, "myAttribute", "myValue").
Also ensure that you have transferred or removed the latest version of each distinct flow file (not the various references to the same flow file) to some relationship (even Relationship.SELF if need be). If your processor creates a new flow file, ensure that new flow file is transferred. If the incoming flow file is no longer needed, be sure to call session.remove() on it.
There are some common patterns and additional guidance in the NiFi Developer's Guide, including test patterns; your unit test(s) for this processor should be able to flush out this error (by asserting how many flow files should have been transferred to which relationship(s) during the test).

Resources