Proper way to trigger a flow retry? - apache-nifi

Consider this flow:
It's a simple flow to authenticate to an HTTP API and handle success/failure. In the failure state, you can see I added a ControlRate processor and that there are 2 FlowFiles in the queue for it. I have it set to only pass one FlowFile every 30 seconds (Time Duration = 30sec Maximum Rate = 1). So the queue will continue to fill during this, if the authentication process continues to fail.
What I want is to essentially drop all but the first FlowFile in this queue, because I don't want it to continue re-triggering the authentication processor after we get a successful authentication.
I believe I can accomplish this by setting the FlowFile Expiration (on the highlighted queue) to be just longer than the 30 second Time Duration of the ControlRate processor. But this seems a bit arbitrary and not quite correct in my mind.
Is there a way to say "take first, drop rest" for the highlighted queue?

Related

How can ı constraint nifi processors response ? (queue) Apache Nifi

When I request an API in Nifi, more than one response returns. And the content of these responses is the same. If I don't stop the processor, it keeps coming. I keep turning the processor on and off quickly. Is there a way to restrict this?
Can I have the API return a certain number of times no matter how many requests it sends? For example, return only 3 requests.
NiFi flows are intended to be always-on streams. If you go to the Scheduling tab of a processor's config, you'll see that, by default, it is scheduled to run continuously (0 ms).
If you don't want this style of streaming behaviour, you need to change the Scheduling of the processor.
You can change it to only schedule the processor every X seconds, or you can change it to run based on a CRON expression.

Apache NiFi Wait select FlowFile by attribute

I am creating a flow for processing some data from multiple sources (same platform, different customer). Each FlowFile is generated by triggering the HandleHttpRequest processor. I can only process one file at a time for certain customer. This process is also asynchronous (I am looping while I don't receive the response from the API that the process was finished).
What I have right now is a Wait/Notify flow, so after one FlowFile gets processed, Wait will release another file to process. However, this will only work for one customer. What I want is to have a dynamic number of Wait processors or one Wait processor, that can release FlowFiles conditionally (by attribute).
Example:
I have customer A and B. Each has generated FlowFiles with attribute
customer: ${cust_name}
These FlowFiles has been stopped in Wait processor and waiting for the notification by the Notify processor. The order of these files is unknown (order of files for one customer is always sorted). This means, that the queue can look like this (A3 B3 A2 A1 B2 B1). What I want is to Notify the Wait processor to release next A element or B element by attribute.
Is something like this possible ?
I found the solution to what I wanted to achieve !
So I have a Wait processor accepting files with an attribute customer, which has either value of A or B.
The files are then flowing in a loop in the Wait processor into wait relationship.
What happens is, that the order of these files entering wait queue is always the same. The Wait processor always look up for the first entry in the queue ant that's it.
To achieve the perpetual cycling of FlowFiles, you need to configure the wait queue with FirstInFirstOutPrioritizer.
However, this will not guarantee that Wait processor will release the oldest FlowFile, because the wait queue is always changing.
But there is a solution for this. There is a Wait Penalty Duration attribute, which will skip the first file in the queue if it did not match the signal, then second, third ... until the desired oldest file was found (or penalty will expire). You can find the whole conversation here https://github.com/apache/nifi/pull/3540
It works with Run schedule set to 0 and wait queue at default settings.

Why does Nifi PutParquet processor create so many tasks?

The Nifi PutParquet Processor with timer driven run schedule of 0 sec with previous processor in stopped status shows ~3000 Tasks for the last 5 minutes.
We are on Nifi 1.9.2.
My expectation would be that this processor only creates tasks if data is in the incoming queue for the processor. Is this some misconfiguration or a bug in the implementation?
The processor is annotated with #TriggerWhenEmpty which lets it execute all the time regardless of data in the incoming queue. The reason for this is because in a kerberized environment, the processor needs a chance to refresh the credentials. It was a common problem with other processors where no data comes in for a long time, say over a weekend, and during that time the kerberos ticket expired, and then when data starts coming in Monday everything fails.
These empty executions shouldn't have a big impact on the system. When the processor executes and no data is available, it just calls yield and returns. The default yield duration is 1 second, but is controllable through the UI.

how to control retries of invoke HTTP processor in nifi

For e.g. invoke HTTP processor, if the retry relationship is connected to itself, how to control the number of retries if 500 related error occurred?
I want to control the number of retries up to some count for e.g. 5 and it much retry at certain time. First retry after 1 min and second after 30 min third after 24 hrs like this I want to do.
There is an open Jira case (NIFI-90) discussing automatic penalization and back-off. In the meantime, if you want to implement your own back-off, you would send your retry relationship to a flow that would eventually be routed back to the InvokeHttp processor (or dropped once the retry count reached the max). The back-off cycle could look like this:
InvokeHttp -[retry]-> UpdateAttribute -> RouteOnAttribute -[give up]-> (Drop)
^ |
| v
|------------------------------------------ (Delay)
UpdateAttribute: Sets/increments a "counter/retry" attribute and/or a
correlated "delay amount" attribute.
RouteOnAttribute: Checks the counter to see if the max number of retries (5, e.g.) has been reached, and sends the flow file to (Drop) if so, otherwise continue on. The (Drop) processor can be an UpdateAttribute or something that auto-terminates its outgoing relationship, or some error-handling/reporting logic.
(Delay): This could be an ExecuteScript processor that delays the transfer of a flow file based on either the current retry count and/or the delay amount. Alternatively you could use ControlRate, but you'd inverse the "delay amount" to set the attribute for ControlRate accordingly (using the previous UpdateAttribute to increase the Rate Controlled Attribute's value).
From NiFi 1.10 version, there is an inbuilt processor named RetryFlowFile. FlowFiles passed to this Processor have a ‘Retry Attribute’ value checked against a configured Maximum Retries value. If the current attribute value is below the configured maximum, the FlowFile is passed to a retry relationship. If the FlowFile’s attribute value exceeds the configured maximum, the FlowFile will be passed to a retries_exceeded relationship.
I really liked this solution, suggested by Alessio Palma (Scroll to find his response). Looks less messy to me. I do wish processors would have retry/timeout option or even something global on a process group level.

How to get time spent inside loop

I have written a load test for a web application. The test script submits a request to the server via HTTP and then polls the server in a While loop with a small timer, to see when the request has been processed. The problem I am having is that in all the listeners (aggregate graph, table, etc.) JMeter only shows the time each request took and not the total time to process the job, i.e. time from initial request sent until response that contains the expected "complete" message.
How can I add something like "profiling points" which will get data onto the listeners graphs? Or is there another way this is typically handled?
You need a Transaction Controller. Put elements times of which you want to aggregate under it. Transaction controller will then appear in all your listeners. Its load and latency times will be sums of those parameters of its nested elements.
Note that this time by default includes all processing within the controller scope, not just the samples, this can be changed by unchecking "Include duration of timer and pre-post processors in generated sample".

Resources