how to control retries of invoke HTTP processor in nifi - apache-nifi

For e.g. invoke HTTP processor, if the retry relationship is connected to itself, how to control the number of retries if 500 related error occurred?
I want to control the number of retries up to some count for e.g. 5 and it much retry at certain time. First retry after 1 min and second after 30 min third after 24 hrs like this I want to do.

There is an open Jira case (NIFI-90) discussing automatic penalization and back-off. In the meantime, if you want to implement your own back-off, you would send your retry relationship to a flow that would eventually be routed back to the InvokeHttp processor (or dropped once the retry count reached the max). The back-off cycle could look like this:
InvokeHttp -[retry]-> UpdateAttribute -> RouteOnAttribute -[give up]-> (Drop)
^ |
| v
|------------------------------------------ (Delay)
UpdateAttribute: Sets/increments a "counter/retry" attribute and/or a
correlated "delay amount" attribute.
RouteOnAttribute: Checks the counter to see if the max number of retries (5, e.g.) has been reached, and sends the flow file to (Drop) if so, otherwise continue on. The (Drop) processor can be an UpdateAttribute or something that auto-terminates its outgoing relationship, or some error-handling/reporting logic.
(Delay): This could be an ExecuteScript processor that delays the transfer of a flow file based on either the current retry count and/or the delay amount. Alternatively you could use ControlRate, but you'd inverse the "delay amount" to set the attribute for ControlRate accordingly (using the previous UpdateAttribute to increase the Rate Controlled Attribute's value).

From NiFi 1.10 version, there is an inbuilt processor named RetryFlowFile. FlowFiles passed to this Processor have a ‘Retry Attribute’ value checked against a configured Maximum Retries value. If the current attribute value is below the configured maximum, the FlowFile is passed to a retry relationship. If the FlowFile’s attribute value exceeds the configured maximum, the FlowFile will be passed to a retries_exceeded relationship.

I really liked this solution, suggested by Alessio Palma (Scroll to find his response). Looks less messy to me. I do wish processors would have retry/timeout option or even something global on a process group level.

Related

Proper way to trigger a flow retry?

Consider this flow:
It's a simple flow to authenticate to an HTTP API and handle success/failure. In the failure state, you can see I added a ControlRate processor and that there are 2 FlowFiles in the queue for it. I have it set to only pass one FlowFile every 30 seconds (Time Duration = 30sec Maximum Rate = 1). So the queue will continue to fill during this, if the authentication process continues to fail.
What I want is to essentially drop all but the first FlowFile in this queue, because I don't want it to continue re-triggering the authentication processor after we get a successful authentication.
I believe I can accomplish this by setting the FlowFile Expiration (on the highlighted queue) to be just longer than the 30 second Time Duration of the ControlRate processor. But this seems a bit arbitrary and not quite correct in my mind.
Is there a way to say "take first, drop rest" for the highlighted queue?

Apache NiFi Wait select FlowFile by attribute

I am creating a flow for processing some data from multiple sources (same platform, different customer). Each FlowFile is generated by triggering the HandleHttpRequest processor. I can only process one file at a time for certain customer. This process is also asynchronous (I am looping while I don't receive the response from the API that the process was finished).
What I have right now is a Wait/Notify flow, so after one FlowFile gets processed, Wait will release another file to process. However, this will only work for one customer. What I want is to have a dynamic number of Wait processors or one Wait processor, that can release FlowFiles conditionally (by attribute).
Example:
I have customer A and B. Each has generated FlowFiles with attribute
customer: ${cust_name}
These FlowFiles has been stopped in Wait processor and waiting for the notification by the Notify processor. The order of these files is unknown (order of files for one customer is always sorted). This means, that the queue can look like this (A3 B3 A2 A1 B2 B1). What I want is to Notify the Wait processor to release next A element or B element by attribute.
Is something like this possible ?
I found the solution to what I wanted to achieve !
So I have a Wait processor accepting files with an attribute customer, which has either value of A or B.
The files are then flowing in a loop in the Wait processor into wait relationship.
What happens is, that the order of these files entering wait queue is always the same. The Wait processor always look up for the first entry in the queue ant that's it.
To achieve the perpetual cycling of FlowFiles, you need to configure the wait queue with FirstInFirstOutPrioritizer.
However, this will not guarantee that Wait processor will release the oldest FlowFile, because the wait queue is always changing.
But there is a solution for this. There is a Wait Penalty Duration attribute, which will skip the first file in the queue if it did not match the signal, then second, third ... until the desired oldest file was found (or penalty will expire). You can find the whole conversation here https://github.com/apache/nifi/pull/3540
It works with Run schedule set to 0 and wait queue at default settings.

Introduce time delay before moving flow files to next processor in NiFi

In NiFi, there exist a data flow to consume from MQTT (ConsumeMQTT) and publish into HDFS path (PutHDFS). I got a requirement to introduce 60 min delay before pushing the consumed data into HDFS path. Found ControlRate and MergeContent processor to be possible solution but not sure.
What is the ideal solution to introduce time delay?
Example: A flow file consumed at 9:00 AM should be published into HDFS at 10:00 AM
You can use an ExecuteScript processor to run a sleep(60*60*1000) loop, but this would unnecessarily use system resources.
I would instead introduce a RouteOnAttribute processor which has an output relationship of one_hour_elapsed going to PutHDFS, and unmatched looped back to itself. The RouteOnAttribute processor should have Routing Strategy set to Route to Property Name and a dynamic property (click the + button on the top right of the Properties tab) named one_hour_elapsed. The Expression Language value should be ${now():toNumber():gt(${entryDate:toNumber():plus(3600000)})}.
This expression:
Gets the current time and converts it to milliseconds since the epoch (now():toNumber())
Gets the entryDate attribute of the flowfile (when it entered NiFi) and converts it to milliseconds and adds one hour (entryDate:toNumber():plus(3600000) [3600000 == 60*60*1000])
Compares the two numbers (a:gt(${b}))
If this is not actually the start of your flow, you can use an UpdateAttribute processor to insert an arbitrary timestamp at any point of your flow and calculate from there.
I would also recommend setting the Yield Duration and Run Schedule of the RouteOnAttribute processor to be substantially higher than usual, as you do not want this processor to run constantly as it will do no work. I'd suggest setting this to 1 or 5 minutes to start, as you are introducing a one hour delay already.
Starting from nifi 1.10 this can be done even easier with the RetryFlowfile processor.
Use penalty duration for setting the delay time:

Nifi - Process the files based on count or time elapsed?

I have a following flow,
ListFile ---> FetchFile ---> ? ExecuteScript (maybe) ---> Notify
Basically, I want to go to Notify, if
Total flowfiles (from fetch files) is say 200; OR
Time elapsed (from last signal) is say 3 hours.
I think the 1st condition is easy to achieve. I can have a groovy script which can read number of flowfiles, if 200 go to SUCCESS or else ROLLBACK the session.
But I want to know how to also check the time elapsed for n (number can be less than 200) flowfiles in queue is more than 3 hours or so?
Update
Here is the problem: We have a batch processing (~200 files and can increase based on business in future) currently. We have a NiFi pipeline, i.e. List, Fetch, Basic validation on checksum, etc and process (call the SQL) which is working fine.
As per the business, throughout the day we can have the correction to data so that we can get all or some of the files to "re-process". That is also fine and working.
Now, as per new requirements, we need to build the process after this "batch" is completed. So in the best case, I can have the MergeContent processor with max bin of n and give the signal or notify to my new processor.
However, as explained above, throughout that day we can get few or all files processed again. So now my "n" may not match the new "number" of files re-processed. Hence, even in this case if we have elapsed say 3 hours, then irrespective of "n" not equal to new number of files reprocessed, I should notify the new process to run again.
Hence, I am looking for n files OR m hours elapsed check.
I think this may be an example of an XY problem -- you're trying to solve a problem and believe that counting the number of files fetched or time elapsed will help, but this pattern is usually discouraged in Apache NiFi and there are other solutions to the original problem. I would encourage you to describe more fully the higher level problem you are trying to solve to see if there is a better solution.
I will answer the question though (none of these are ideal solutions).
You can use a MergeContent processor with a minimum bin count of 200
You can use an ExecuteScript processor as you noted
You can write a value (the current timestamp) to a DistributedCacheMapServer when the Notify processor executes, and check that value with a FetchDistributedCacheMap processor against the current timestamp and use a simple Expression Language statement to compare the timestamp values
I think you may also want to read some examples of Wait/Notify logic, because creating thresholds like "200 incoming flowfiles || 3 hours elapsed time" is what the Wait processor does.
"How to wait for all fragments to be processed, then do something?" by Koji Kawamura
"NiFi workflow monitoring – Wait/Notify pattern with split and merge" by Pierre Villard
"Simple NiFi Wait/Notify Example" answer by Abdelkrim Hadjidj

ElasticSearch gives error about queue size

RemoteTransportException[[Death][inet[/172.18.0.9:9300]][bulk/shard]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1#12ae9af];
Does this mean I'm doing too many operations in one bulk at one time, or too many bulks in a row, or what? Is there a setting I should be increasing or something I should be doing differently?
One thread suggests "I think you need to increase your 'threadpool.bulk.queue_size' (and possibly 'threadpool.index.queue_size') setting due to recent defaults." However, I don't want to arbitrarily increase a setting without understanding the fault.
I lack the reputation to reply to the comment as a comment.
It's not exactly the number of bulk requests made, it is actually the total number of shards that will be updated on a given node by the bulk calls. This means the contents of the actual bulk operations inside the bulk request actually matter. For instance, if you have a single node, with a single index, running on an 8 core box, with 60 shards and you issue a bulk request that has indexing operations that affects all 60 shards, you will get this error message with a single bulk request.
If anyone wants to change this, you can see the splitting happening inside of org.elasticsearch.action.bulk.TransportBulkAction.executeBulk() near the comment "go over all the request and create a ShardId". The individual requests happen a few lines down around line 293 on version 1.2.1.
You want to up the number of bulk threads available in the thread pool. ES sets aside threads in several named pools for use on various tasks. These pools have a few settings; type, size, and queue size.
from the docs:
The queue_size allows to control the size of the queue of pending
requests that have no threads to execute them. By default, it is set
to -1 which means its unbounded. When a request comes in and the queue
is full, it will abort the request.
To me that means you have more bulk requests queued up waiting for a thread from the pool to execute one of them than your current queue size. The documentation seems to indicate the queue size is defaulted to both -1 (the text above says that) and 50 (the call out for bulk in the doc says that). You could take a look at the source to be sure for your version of es OR set the higher number and see if your bulk issues simply go away.
ES thread pool settings doco
elasticsearch 1.3.4
our system 8 core * 2
4 bulk worker each insert 300,000 message per 1 min => 20,000 per sec
i'm also that exception! then set config
elasticsearch.yml
threadpool.bulk.type: fixed
threadpool.bulk.size: 8 # availableProcessors
threadpool.bulk.queue_size: 500
source
BulkRequestBuilder bulkRequest = es.getClient().prepareBulk();
bulkRequest.setReplicationType (ReplicationType.ASYNC).setConsistencyLevel(WriteConsistencyLevel.ONE);
loop begin
bulkRequest.add(es.getClient().prepareIndex(esIndexName, esTypeName).setSource(document.getBytes ("UTF-8")));
loop end
BulkResponse bulkResponse = bulkRequest.execute().actionGet();
4core => bulk.size 4
then no error
I was having this issue and my solution ended up being increasing ulimit -Sn and ulimit Hn for the elasticsearch user. I went from 1024 (default) to 99999 and things cleaned right up.

Resources