I am new to Apache Nifi. I went through a problem where the Merge Content processor of Nifi has huge chunks of file queued. I really dont know how to take those files out of the queue. There is no errors and the files are getting queued for no reason.
Even if we stop the processor and list the queue it says queue has no flow files
Below is the screenshot of the same. I really appreciate if I can get some resolution on the same.
Connection Details
QUEUE ERROR
What I would suggest is checking what the flow files have inside of them and if there is data that NiFi can actually merge together. If there isn’t any data to be able to merge, try to clear the queues and determine how to get the data into a state that they can be merged.
What this means is that if flow file A has fields A, B, and D, and flow file B has fields F, J, and L, see if you can get common fields together so they can be merged.
Related
I stored all the required parquet tables in a Hadoop Filesystem, and all these files have a unique path for identification. These paths are pushed into a RabbitMQ queue as a JSON and is consumed by the consumer (in CherryPy) for processing. After successful consumption, the first path is sent for reading and the following paths will be read once the preceding read processes are done. Now to read a specific table I am using the following line of code,
data_table = parquet.read_table(path_to_the_file)
Let's say I have five read tasks in the message. The first read process is being carried out and gets read successfully and now before the other reading tasks are yet to be performed I just manually stopped my server. This stop would not send a message execution successful acknowledgement to the queue as there are a four remaining read processes. Once I restart the server, the whole consumption and reading processes starts from the initial stage. And now when the read_table method is called upon the first path, it gets stuck totally.
Digging up inside the work flow of read_table method, I found out where it actually gets stuck.
But further explanations of this method for reading a file inside a hadoop filesystem is required.
path = 'hdfs://173.21.3.116:9000/tempDir/test_dataset.parquet'
data_table = parquet.read_table(path)
Can somebody please give me a picture of the internal implementation that happens after calling this method? So that I could find where the issue is actually occurred and a solution to it.
I have "unmatched" flowfiles in a queue. Is there any way to transfer these flowfiles into another queue?
EDIT:
WITH #Andy's SUGGESTED SOLUTION - #RESOLVED
There isn't a way to directly transfer between queues because it would take away the meaning of how those flow files got in the queue. They have to pass through the previous processor which is making the decision about which queue to place them in. You can create loops using a processor that does nothing like UpdateAttribute, and then connect that back to the original processor.
Bryan's answer is comprehensive and explains the ideal process for on-going success. If this is a one-time task (I have this queue that contains data I was using during testing; now I want it to go to this other processor), you can simply select the queue containing the data and drag the blue endpoint to the other component.
The flow files are stuck in the queue(Load balance by attribute) and are not read by the next downstream processor(MergeRecord with CSVReader and CSVRecordSetWriter). From the Nifi UI, it appears that flow files are in the queue but when tried to list queue it says "Queue has no flow files". Attempting to empty queue also gives the exact message. Nifi Logs doesn't have any exceptions related to the processor. There are around 80 flow files in queue.
I have tried below action items but all in vain:
Restarting the downstream and upstream(ConvertRecord) processor.
Disabled and enabled CSVReader and CSVRecordSetWriter.
Disabled load balancing.
Flow file expiration set to 3 sec.
Screenshot:
Flowfile:
MergeRecord properties:
CSVReader Service:
CSVRecordSetWriter:
Your merge record processor is running only on the primary node, and likely all the files are on other nodes (since you are load balancing). NiFi is not aware enough to notice that the downstream processor is only running on the primary, so it does not automatically rebalance everything to the primary node. Simply changing the MergeRecord to run on all nodes will allow the files to pass through.
Alas, I have not found a way to get all flow files back on the primary node, you can use the "Single Node" load balance strategy to get all the files on the same node, but it will not necessarily be the primary.
This is probably because the content of the flowfile was deleted. However, the entry of it is still present in the flow-file registry.
if you have a dockerized nifi setup and if you dont have a heavy production flow, you can stop your nifi flow and delete everything in the _*repository folders (flowfile-repository, content repository etc)
(provided you have all you directories mounted and no other data loss is at risk)
Let me know if you need further assistance
You have a miss configuration in the way you load balance your FlowFiles. To check that stop your MergeRecord processor to be able to check and view what's inside your queue.
In the modal window displayed you can check where are your flowfiles waiting, it's highly probable that your FlowFiles are in fact on one of the other node(s) but since the MergeRecord is running on the primary node then it has nothing in its Queue.
My Apache NiFi instance just hangs on the "Computing FlowFile lineage..." for a specific flow. Others work, but it won't show the lineage for this specific flow for any data files. The only error message in the log is related to an error in one of the processors, but I can't see how that would affect the lineage, or stop the page from loading.
This was related to two things...
1) I was using the older (but default) provenance repository, which didn't perform well, resulting in the lag in the UI. So I needed to change it...
#nifi.provenance.repository.implementation=org.apache.nifi.provenance.PersistentProvenanceRepository
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
2) Fixing #1 exposed the second issue, which was that the EnforceOrder processor was generating hundreds of provenance events per file, because I was ordering on a timestamp, which had large gaps between the values. This is apparently not a proper use case for the EnforceOrder processor. So I'll have to remove it and find another way to do the ordering.
I'll try to explain this the best I can.
As I store my data that I receive from my ActiveMQ queue in several distinct locations, I have decided to build a composite Queue so I can process the data for each location individually.
The issue I am running into is that I currently have the Queue in a production environment. It seems that changing a queue named A to a composite Queue also called A having virtual destinations named B and C causes me to lose all the data on the existing Queue. It does not on start-up forward the previous messages. Currently, I am creating a new CompositeQueue with a different name, say D, which forwards data to B and C. Then I have some clunky code that prevents all connections until I have both a) updated all the producers to send to D and b) pulled the data from A using a consumer and sent it to D with a producer.
It feels rather messy. Is there any way around this? Ideally I would be able to keep the same Queue name, have all its current data sent to the composite sub-queues, and have the Queue forward only in the end.
From the description given the desired behavior is no possible as message routing on the composite queue works when messages are in-flight and not sometime later when that queue has already stored messages and the broker configuration is changed. You need to consume the past messages from the initial Queue (A I guess it is) and send them onto the destinations desired.