how to optimally use nifi wait processor

I am currently creating a flow, where I will be merging result of 10K http response. I have couple of questions. (please refer image below, I am numbering my questions as per image).
1) As queue is becoming too long, is it ok to put "concurrent task" as 10 for invokeHTTP? what should drive this? # of cores on the server?
2) wait is showing quite a big number, is this just # of bytes it is writing? or is this using that much memory? if this is just a write, then I might be ok...but if it is some internal queue, then soon I may run out of memory?
does it make sense to reduce this number? by increasing "Run Schedule" from 0 to say 20 sec?
3) what exactly is "Back Pressure Data Size Threshold", value is set at 1 GB, does it meant, if size of ff in queue is more than that, nifi will start dropping it? or will it somehow stop processing of upstream processor?

1) Yes increasing concurrent tasks on InvokeHttp would probably make sense. I wouldn't jump right to 10, but would test increasing from 1 to 2, 2 to 3, etc until it seems to be working better. Concurrent tasks is the number of threads that can concurrently execute the processor, the total number of threads for your NiFi instance is defined in the controller settings from top right menu under Timer Driven threads, you should set the timer driven threads based of the # of CPUs/core you have.
2) The stats on the processor are totals for the last 5 mins, so "In" is the total size of all the flow files that have come in to the processor in the last 5 mins. You can see "Out" is almost the same # which means almost all the flow files in have also been transferred out.
3) Back-pressure stops the upstream processor from executing until the back pressure threshold is reduced. The data size threshold is saying "when the total size of all flow files in the queue exceeds 1GB, then stop executing the upstream processor so that no more data enters the queue while the downstream processor works on the queue". In the case of a self-loop connection, I think back-pressure won't stop the processor from executing otherwise it will end up in a dead-lock where it can't produce more data but also can't work off the queue. In any case, data is never dropped unless you set flow file expiration on the queue.


Golang: How to tell whether producer or consumer is slower when communicating via buffered channels?

I have an app in Golang where I have a pipeline setup where each component performs some work, then pass along its results to another component via a buffered channel, then that component performs some work on its input then pass along its results to yet another component via another buffered channel, and so on. For example:
C1 -> C2 -> C3 -> ...
where C1, C2, C3 are components in the pipeline and each "->" is a buffered channel.
In Golang buffered channels are great because it forces a fast producer to slow down to match its downstream consumer (or a fast consumer to slow down to match its upstream producer). So like an assembly line, my pipeline is moving along as fast as the slowest component in that pipeline.
The problem is I want to figure out which component in my pipeline is the slowest one so I can focus on improving that component in order to make the whole pipeline faster.
The way that Golang forces a fast producer or a fast consumer to slow down is by blocking the producer when it tries to send to a buffered channel that is full, or when a consumer tries to consume from a channel that is empty. Like this:
outputChan <- result // producer would block here when sending to full channel
input := <- inputChan // consumer would block here when consuming from empty channel
This makes it hard to tell which one, the producer or consumer, is blocking the most, and thus the slowest component in pipeline. As I cannot tell how long it is blocking for. The one that is blocking the most amount of time is the fastest component and the one that is blocking the least (or not blocking at all) is the slowest component.
I can add code like this just before the read or write to channel to tell whether it would block:
// for producer
if len(outputChan) == cap(outputChan) {
outputChan <- result
// for consumer
if len(inputChan) == 0 {
input := <-inputChan
However, that would only tell me the number of times it would block, not the total amount of time it is blocked. Not to mention the TOCTOU issue where the check is for a single point in time where state could change immediately right after the check rendering the check incorrect/misleading.
Anybody that has ever been to a casino knows that it's not the number of times that you win or lose that matters, it's the total amount of money that you win or lose that's really matter. I can lose 10 hands with $10 each (for a total of $100 loss) and then wins one single hand of $150, I would still comes out ahead.
Likewise, it's not the number of times that a producer or consumer is blocked that's meaningful. It's the total amount of time that a producer or consumer is blocked that's the determining factor whether it's the slowest component or not.
But I cannot think of anyway to determine the total amount that something is blocked at the reading to / writing from a buffered channel. Or my google-fu isn't good enough. Anyone has any bright idea?
There are several solutions that spring to mind.
1. stopwatch
The least invasive and most obvious is to just note the time,
before and after,
each read or write.
Log it, sum it, report on total I/O delay.
Similarly report on elapsed processing time.
2. benchmark
Do a synthetic bench,
where you have each stage operate on a million
identical inputs, producing a million identical outputs.
Or do a "system test" where you wiretap the
messages that flowed through production,
write them to log files,
and replay relevant log messages to each
of your various pipeline stages,
measuring elapsed times.
Due to the replay, there will be no I/O throttling.
3. pub/sub
Re-architect to use a higher overhead
comms infrastructure, such as Kafka / 0mq / RabbitMQ.
Change the number of nodes participating
in stage-1 processing, stage-2, etc.
The idea is to overwhelm the stage currently
under study, no idle cycles, to measure
its transactions / second throughput
when saturated.
Alternatively, just distribute each stage
to its own node, and measure {user, sys, idle} times,
during normal system behavior.

Long delays between processing of two consecutive kafka batches (using ruby/karafka consumer)

I am using karafka to read from a topic, and call an external service. Each call to external service takes roughly 300ms. And with 3 consumers (3 pods in the k8s) running in the consumer group, I expect to achieve 10 events per second. I see these loglines , which also confirm the 300ms expectation for processing each individual event.
However, the overall throughput doesn't add up. Each karafka processes seems stuck for a long time between processing two batches of events.
Following instrumentation around the consume method, implies that the consumer code itself is not taking time.
INFO Inline processing of topic with 8 messages took 2571 ms
INFO 8 messages on topic delegated to xyz
However, I notice two things:
When I tail logs on the 3 pods, only one of the 3 pods seems to emit logs a time. This does not make sense to me. As all partitions have enough events, and each consumer should be able to consumer in parallel.
Though, the above message roughly shows 321ms (2571/8) per event, in reality I see the logs stalled for a long duration between processing of two batches. I am curious, where is that time going?
There is some skew in the distribution of data across brokers - as we recently expanded our brokers from 3 to total of 6. However, none of the brokers is under cpu or disk pressure. This is a new cluster, and hardly 4-5% cpu is used at peak times.
Our data is evenly distributed in 3 partitions - I say this as the last offset is roughly the same across each partition.
However, I do see that one consumer perpetually lags behind the other two.
Following table shows the lag for my consumers. There is one consumer process for each partition:
First Offset
Last Offset
Consumer Offset
Combined lag
Here is a screenshot of the logs from all 3 consumers. You can notice the big difference between time spent in each invocation of consume function and interval between two adjacent invocations. Basically, i want to explain and/or reduce that waiting time. There are 100k+ events in this topic and my dummy karafka applications are able to quickly retrieve them, so kafka brokers are not an issue.
Update after setting max_wait_time to 1 second (previously 5 second)
It seems that the issue is resolved after reducing the wait config. Now the difference between two consecutive logs is roughly equal to the time spent in consume
2021-06-24 13:43:23.425 Inline processing of topic x with 7 messages took 2047 ms
2021-06-24 13:43:27.787 Inline processing of topic x with 11 messages took 3347 ms
2021-06-24 13:43:31.144 Inline processing of topic x with 11 messages took 3344 ms
2021-06-24 13:43:34.207 Inline processing of topic x with 10 messages took 3049 ms
2021-06-24 13:43:37.606 Inline processing of topic x with 11 messages took 3388 ms
There are a couple of problems you may be facing. It is a bit of a guessing from my side without more details but let's give it a shot.
From the Kafka perspective
Are you sure you're evenly distributing data across partitions? Maybe it is eating up things from one partition?
What you wrote here:
INFO Inline processing of topic with 8 messages took 2571 ms
This indicates that there was a batch of 8 processed altogether by a single consumer. This could indicate that the data is not distributed evenly.
From the performance perspective
There are two performance properties that can affect your understanding of how Karafka operates: throughput and latency.
Throughput is the number of messages that can be processed in a given time
Latency is the time it takes a message from the moment it was produced to it been processed.
As far as I understand, all messages are being produced. You could try playing with the Karafka settings, in particular this one:
From the logger perspective
Logger that is being used flushes data from time to time, so you won't see it immediately but after a bit of time. You can validate this by looking at the log time.

In Nifi, what is the difference between FirstInFirstOutPrioritizer and OldestFlowFileFirstPrioritizer

User guide has the below details on prioritizers, could you please help me understand how these are different and provide any real time example.
FirstInFirstOutPrioritizer: Given two FlowFiles, the one that reached the connection first will be processed first.
OldestFlowFileFirstPrioritizer: Given two FlowFiles, the one that is oldest in the dataflow will be processed first. 'This is the default scheme that is used if no prioritizers are selected.'
Imagine two processors A and B that are both connected to a funnel, and then the funnel connects to processor C.
Scenario 1 - The connection between the funnel and processor C has first-in-first-out prioritizer.
In this case, the flow files in the queue between the funnel and connection C will be processed strictly based on the order they reached the queue.
Scenario 2 - The connection between the funnel and processor C has oldest-flow-file-first prioritizer.
In this case, there could already be flow files in the queue between the funnel and connection C, but one of the processors transfers a flow to that queue that is older than all the flow files in that queue, it will jump to the front.
You could imagine that some flow files come from a different portion of the flow that takes way longer to process than other flow files, but they both end up funneled into the same queue, so these flow files from the longer processing part are considered older.
Apache NiFi handles data from many disparate sources and can route it through a number of different processors. Let's use the following example (ignore the processor types, just focus on the titles):
First, the relative rate of incoming data can be different depending on the source/ingestion point. In this case, the database poll is being done once per minute, while the HTTP poll is every 5 seconds, and the file tailing is every second. So even if a database record is 59 seconds "older" than another, if they are captured in the same execution of the processor, they will enter NiFi at the same time and the flowfile(s) (depending on splitting) will have the same origin time.
If some data coming into the system "is dirty", it gets routed to a processor which "cleans" it. This processor takes 3 seconds to execute.
If both the clean relationship and the success relationship from "Clean Data" went directly to "Process Data", you wouldn't be able to control the order that those flowfiles were processed. However, because there is a funnel that merges those queues, you can choose a prioritizer on the queued queue, and control that order. Do you want the first flowfile to enter that queue processed first, or do you want flowfiles that entered NiFi earlier to be processed first, even if they entered this specific queue after a newer flowfile?
This is a contrived example, but you can apply this to disaster recovery situations where some data was missed for a time window and is now being recovered, or a flow that processes time-sensitive data and the insights aren't valid after a certain period of time has passed. If using backpressure or getting data in large (slow) batches, you can see how in some cases, oldest first is less valuable and vice versa.

Recovery techniques for Spark Streaming scheduling delay

We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
Go in each job, and see the data/records processed by each executor. you can find problems here.
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.

Google App Engine Task Queue

I want to run 50 tasks. All these tasks execute the same piece of code. Only difference will be the data. Which will be completed faster ?
a. Queuing up 50 tasks in a queue
b. Queuing up 5 tasks each in 10 different queue
Is there any ideal number of tasks that can be queued up in 1 queue before using another queue ?
The rate at which tasks are executed depends on two factors: the number of instances your app is running on, and the execution rate of the queue the tasks are on.
The maximum task queue execution rate is now 100 per queue per second, so that's not likely to be a limiting factor - so there's no harm in adding them to the same queue. In any case, sharding between queues for more execution rate is at best a hack. Queues are designed for functional separation, not as a performance measure.
The bursting rate of task queues is controlled by the bucket size. If there is a token in the queue's bucket the task should run immediately. So if you have:
- name: big_queue
rate: 50/s
bucket_size: 50
And haven't queue any tasks in a second all tasks should start right away.
see for more information.
Splitting the tasks into different queues will not improve the response time unless the bucket hadn't had enough time to completely fill with tokens.
I'd add another factor into the mix- concurrency. If you have slow running (more than 30 seconds or so) tasks, then AppEngine seems to struggle to scale up the correct number of instances to deal with the requests (seems to max out about 7-8 for me).
As of SDK 1.4.3, there's a setting in your queue.xml and your appengine-web.config you can use to tell AppEngine that each instance can handle more than one task at a time:
<threadsafe>true</threadsafe> (in appengine-web.xml)
<max-concurrent-requests>10</max-concurrent-requests> (in queue.xml)
This solved all my problems with tasks executing too slowly (despite setting all other queue params to the maximum)
More Details (
Queue up 50 tasks and set your queue to process 10 at a time or whatever you would like if they can run independently of each other. I see a similar problem and I just run 10 tasks at a time to process the 3300 or so that I need to run. It takes 45 minutes or so to process all of them but the CPU time used is negligible surprisingly.
