How to solve the relationship failure? - hortonworks-data-platform

I have a processor that appears to be creating FlowFiles correctly (modified a standard processor), but when it goes to commit() the session, an exception is raised:
2016-10-11 12:23:45,700 ERROR [Timer-Driven Process Thread-6] c.s.c.processors.files.GetFileData [GetFileData[id=8f5e644d-591c-4df1-8c79-feea118bd8c0]] Failed to retrieve files due to {}  org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord transfer relationship not specified
I'm assuming this is supposed to be indicating there's no connection available to commit the transfer; however, there is a "success" relationship registered during init() in same way as original processor did it, and the success relationship out is connected to another processor input as it should be.
Any suggestions for troubleshooting?

What changes did you make to the standard processor? If you are calling methods on the ProcessSession object, ensure that you are saving the latest "version" of the FlowFile returned from those method calls, and transfer only the latest version to "success".
FlowFile references are immutable; often in code you will see an initial reference like "flowFile" pointing at the incoming flow file (from session.get() for example), then it gets updated as the flow file is mutated, such as flowFile = session.putAttribute(flowFile, "myAttribute", "myValue").
Also ensure that you have transferred or removed the latest version of each distinct flow file (not the various references to the same flow file) to some relationship (even Relationship.SELF if need be). If your processor creates a new flow file, ensure that new flow file is transferred. If the incoming flow file is no longer needed, be sure to call session.remove() on it.
There are some common patterns and additional guidance in the NiFi Developer's Guide, including test patterns; your unit test(s) for this processor should be able to flush out this error (by asserting how many flow files should have been transferred to which relationship(s) during the test).

Related

How to get NiFi processor error into Flowfile attribute?

I have a PutGCSObject processor for which I want to capture the error into a flow file attribute.
As in the Picture, when there is an error for the Processor, it sends to failure with all the pre-existing attributes as-is.
I want the error message to be a part of the same flow file as an attribute. How can I achieve that ?
There is actually a way to get it.
Here is how i do it:
1: I route all ERROR connections to a main "monitoring process group"
2: Here is my "monitoring process group"
In updateattribute I capture filename as initial_filename
Then in my next step I query the bulletins
I then parse the output as individual attributes.
After I have the parsed bulleting output I use a RouteOnAttribute proc to drop all bulletins I don't need (some of them I have already used and notified on).
Once I only have my actual ERROR bulletin left, I use ExecuteStreamingCommand to run a python script using nipyapi module to get more info about the error, such as where it is in my flow, hierarchy, a description of the processor that failed, some proc stats and also I have metadata catalog about each proc/process group with their custodians and business use case.
This data is then posted to sumologic for logging and also I trigger a series of notifications (Slack + PagerDuty hook to create an incident lifecycle).
I hope this helps
There's no universal way to append error messages as flowfile attributes. Also, we tend to strongly avoid anything like that because of the potential to bubble up error messages with sensitive data to users who might not be authorized to see those details.

Need explanation on internal working of read_table method in pyarrow.parquet

I stored all the required parquet tables in a Hadoop Filesystem, and all these files have a unique path for identification. These paths are pushed into a RabbitMQ queue as a JSON and is consumed by the consumer (in CherryPy) for processing. After successful consumption, the first path is sent for reading and the following paths will be read once the preceding read processes are done. Now to read a specific table I am using the following line of code,
data_table = parquet.read_table(path_to_the_file)
Let's say I have five read tasks in the message. The first read process is being carried out and gets read successfully and now before the other reading tasks are yet to be performed I just manually stopped my server. This stop would not send a message execution successful acknowledgement to the queue as there are a four remaining read processes. Once I restart the server, the whole consumption and reading processes starts from the initial stage. And now when the read_table method is called upon the first path, it gets stuck totally.
Digging up inside the work flow of read_table method, I found out where it actually gets stuck.
But further explanations of this method for reading a file inside a hadoop filesystem is required.
path = 'hdfs://173.21.3.116:9000/tempDir/test_dataset.parquet'
data_table = parquet.read_table(path)
Can somebody please give me a picture of the internal implementation that happens after calling this method? So that I could find where the issue is actually occurred and a solution to it.

NiFi How to get the current processor Name and Processor group name through the custom processor using (Java)

I'm Creating the NiFi Custom processor using Java,
one of the requirement is to get the previous processor name and processor group (like a breadcrumb) using java code.
The previous processor name and process group name is not immediately (nor meant to be) available to processors, can you explain more about your use case? You can perhaps use a SiteToSiteProvenanceReportingTask to send provenance information back to your own NiFi instance (an Input Port, e.g.) and find the events that correspond to FlowFiles entering your custom processor, the events should have the source (previous) processor and destination (your custom) processor.
If instead you code your custom processor using InvokeScriptedProcessor with Groovy for example, then you can "bend the rules" and get at the previous processor name and such, as Groovy allows access to private members and you can assume the implementation of the ProcessContext in onTrigger is an instance of StandardProcessContext, so you can get at its members which include upstream connections and thus the previous processor. For a particular FlowFile though, I'm not sure you can use this approach to know which upstream processor it came from.
Alternatively, you could add an UpdateAttribute after each "previous processor" to set attribute(s) with the information about that processor, but that has to be hardcoded and applied to every corresponding part of the flow.
I faced this some time back. I used InvokeHTTP processor and used nifi-api/process-groups/${process_group_id} Web Service
This is how I implemented:
Identify the process group where the error handling should be done. [Action Group]
Create a new process group [Error Handling Group] next to the Action Group and add relationship to transfer files to Error Handling Group.
Use the InvokeHTTP processor and set HTTP Method to GET
Set Remote URL to http://{nifi-instance}:{port}/nifi-api/process-groups/${action_group_process_group_id}
You will get response in JSON which you will have to customize according to your needs
Please let me know if you need the XML file that I am using. I can share that. It just works fine for me

what is a fastest way to remove nifi flowfile content?

I have a workflow where I am getting json files as a response of rest api. I am getting approximately 100k files in a session. total size of all the files is 15GB. I have to save each file to file system, which i am doing. at the end of the process I have to wait for all the files to be present before I send a success message.
Once I save the file in FS, I am calling notify+wait. but I dont need 15 gb data in flowfile anymore. So to release some space, I thought of using either replaceText or ModifyByte to clear content. so notify+wait runs smoothly. Total wait for this process is 3 hrs.
But process is taking too long in both (replaceText or ModifyByte) case.
Can you suggest, fastest way to clear flowfile data.I do not need any attributes too. so is thr a way I can abandon old flowfile and generate kb flowfile, midway?
what i want is something like generateflowfile, but in middle, so for each of my existing flowfile, i can drop old one, and generate blank flowfile for notify and wait.
Thanks
NiFi's Content Repository and FlowFile Repository are based on a copy-on-write mechanism, so if you don't change the contents or metadata, then you are not necessarily "keeping" the 15GB across those processors.
Having said that, if all you need is the existence of such flow files on disk (but not contents or metadata), try ExecuteScript with the following Groovy script:
def flowFiles = session.get(1000)
flowFiles.each {
session.transfer(session.create(), REL_SUCCESS)
}
session.remove(flowFiles)
This script will grab up to 1000 flow files at a time, and for each one, send an empty flow file downstream. It then removes all the original incoming flow files.
Note that this (i.e. your use case) will "break" the provenance/lineage chain, so if something goes wrong in your flow, you won't be able to tell which flow files came from which parent flow files, etc. This limitation is one reason why you don't see a full processor that performs this kind of function.
In case you need to keep the attributes, lineage and metadata you can use the following code (grabs only 1 flowfile at a time). The only thing that changes is the UUID, but otherwise everything is kept - except the content of course.
f = session.get()
session.transfer(session.create(f), REL_SUCCESS)
session.remove(f)

Apache NiFi instance hangs on the "Computing FlowFile lineage..." window

My Apache NiFi instance just hangs on the "Computing FlowFile lineage..." for a specific flow. Others work, but it won't show the lineage for this specific flow for any data files. The only error message in the log is related to an error in one of the processors, but I can't see how that would affect the lineage, or stop the page from loading.
This was related to two things...
1) I was using the older (but default) provenance repository, which didn't perform well, resulting in the lag in the UI. So I needed to change it...
#nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
2) Fixing #1 exposed the second issue, which was that the EnforceOrder processor was generating hundreds of provenance events per file, because I was ordering on a timestamp, which had large gaps between the values. This is apparently not a proper use case for the EnforceOrder processor. So I'll have to remove it and find another way to do the ordering.

Resources