How can ı constraint nifi processors response ? (queue) Apache Nifi - apache-nifi

When I request an API in Nifi, more than one response returns. And the content of these responses is the same. If I don't stop the processor, it keeps coming. I keep turning the processor on and off quickly. Is there a way to restrict this?
Can I have the API return a certain number of times no matter how many requests it sends? For example, return only 3 requests.

NiFi flows are intended to be always-on streams. If you go to the Scheduling tab of a processor's config, you'll see that, by default, it is scheduled to run continuously (0 ms).
If you don't want this style of streaming behaviour, you need to change the Scheduling of the processor.
You can change it to only schedule the processor every X seconds, or you can change it to run based on a CRON expression.

Related

Proper way to trigger a flow retry?

Consider this flow:
It's a simple flow to authenticate to an HTTP API and handle success/failure. In the failure state, you can see I added a ControlRate processor and that there are 2 FlowFiles in the queue for it. I have it set to only pass one FlowFile every 30 seconds (Time Duration = 30sec Maximum Rate = 1). So the queue will continue to fill during this, if the authentication process continues to fail.
What I want is to essentially drop all but the first FlowFile in this queue, because I don't want it to continue re-triggering the authentication processor after we get a successful authentication.
I believe I can accomplish this by setting the FlowFile Expiration (on the highlighted queue) to be just longer than the 30 second Time Duration of the ControlRate processor. But this seems a bit arbitrary and not quite correct in my mind.
Is there a way to say "take first, drop rest" for the highlighted queue?

kg.apc.jmeter.functions.FifoTimeout does not work if added to user.properties in jmeter

I am using inter thread communicator plugin to share data between two thread groups.
TG-1: generates an ID -> stores it in the queue name Q1 TG-2: picks an ID from queue -> does the processing
After some time when run duration of TG-1 is completed, it stops processing or storing ID in to Q1. TG-2 processed all the data in the queue and keep on waiting for new data in the Q1. However Q1 will not have any data. My expectation was when the run duration of TG-2 also completed. TG-2 should finish its job and exit. Why does TG-2 keep on waiting for data in Q1. This is causing the exhaustion of the heap space and test never stops. This is causing a serious issue.
To prevent this, I tried adding kg.apc.jmeter.functions.FifoTimeout=120 in user.properties file as suggested by Dmitri T in one of my previous question for the same thing. However this property is not taking effect. Has anybody else also experience the same thing with this plugin? What is the alternative?
We are not telepathic enough to guess what is your setup, what exact components of the Inter-Thread Communication Plugin you're using and how they're configured.
If you're using Functions - the timeout is working fine for the __fifoPop() function, just make sure to restart JMeter after amending the property. __fifoGet() one will just return an empty value if the queue is empty
If you're using jp#gc - Inter-Thread Communication PreProcessor - there is a possibility to specify the timeout directly in GUI
Also it is always possible to stop the test via Flow Control Action Sampler

Design Pattern - Spring KafkaListener processing 1 million records in 1 hour

My spring boot application is going to listen to 1 million records an hour from a kafka broker. The entire processing logic for each message takes 1-1.5 seconds including a database insert. Broker has 64 partitions, which is also the concurrency of my #KafkaListener.
My current code is only able to process 90 records in a minute in a lower environment where I am listening to around 50k records an hour. Below is the code and all other config parameters like max.poll.records etc are default values:
#KafkaListener(id="xyz-listener", concurrency="64", topics="my-topic")
public void listener(String record) {
// processing logic
}
I do get "it is likely that the consumer was kicked out of the group" 7-8 times an hour. I think both of these issues can be solved through isolating listener method and multithreading processing of each message but I am not sure how to do that.
There are a few points to consider here. First, 64 consumers seems a bit too much for a single application to handle consistently.
Considering each poll by default fetches 500 records per consumer at a time, your app might be getting overloaded and causing the consumers to get kicked out of the group if a single batch takes more than the 5 minutes default for max.poll.timeout.ms to be processed.
So first, I'd consider scaling the application horizontally so that each application handles a smaller amount of partitions / threads.
A second way to increase throughput would be using a batch listener, and handling processing and DB insertions in batches as you can see in this answer.
Using both, you should be processing a sensible amount of work in parallel per app, and should be able to achieve your desired throughput.
Of course, you should load test each approach with different figures to have proper metrics.
EDIT: Addressing your comment, if you want to achieve this throughput I wouldn't give up on batch processing just yet. If you do the DB operations row by row you'll need a lot more resources for the same performance.
If your rule engine doesn't do any I/O you can iterate each record from the batch through it without losing performance.
About data consistency, you can try some strategies. For example, you can have a lock to ensure that even through a rebalance only one instance will process a given batch of records at a given time - or perhaps there's a more idiomatic way of handling that in Kafka using the rebalance hooks.
With that in place, you can batch load all the information you need to filter out duplicated / outdated records when you receive the records, iterate each record through the rule engine in memory, and then batch persist all results, to then release the lock.
Of course, it's hard to come up with an ideal strategy without knowing more details about the process. The point is by doing that you should be able to handle around 10x more records within each instance, so I'd definitely give it a shot.

Why does Nifi PutParquet processor create so many tasks?

The Nifi PutParquet Processor with timer driven run schedule of 0 sec with previous processor in stopped status shows ~3000 Tasks for the last 5 minutes.
We are on Nifi 1.9.2.
My expectation would be that this processor only creates tasks if data is in the incoming queue for the processor. Is this some misconfiguration or a bug in the implementation?
The processor is annotated with #TriggerWhenEmpty which lets it execute all the time regardless of data in the incoming queue. The reason for this is because in a kerberized environment, the processor needs a chance to refresh the credentials. It was a common problem with other processors where no data comes in for a long time, say over a weekend, and during that time the kerberos ticket expired, and then when data starts coming in Monday everything fails.
These empty executions shouldn't have a big impact on the system. When the processor executes and no data is available, it just calls yield and returns. The default yield duration is 1 second, but is controllable through the UI.

MiNiFi - How to get processors list and number of queued flowfiles?

I'd like to monitor state of the running MiNiFi flow, especially get list of the processors and number of queued flowfiles for each processor. I'm trying to use FlowStatus Script Query, eg.:
$ ./minifi.sh flowStatus systemdiagnostics:processorstats
{"controllerServiceStatusList":null,"processorStatusList":null,"connectionStatusList":null,"remoteProcessGroupStatusList":null,"instanceStatus":null,"systemDiagnosticsStatus":{"garbageCollectionStatusList":null,"heapStatus":null,"contentRepositoryUsageList":null,"flowfileRepositoryUsage":null,"processorStatus":{"loadAverage":1.99,"availableProcessors":2}},"reportingTaskStatusList":null,"errorsGeneratingReport":[]}
$ ./minifi.sh flowStatus processor:all:health,stats,bulletins
{"controllerServiceStatusList":null,"processorStatusList":[],"connectionStatusList":null,"remoteProcessGroupStatusList":null,"instanceStatus":null,"systemDiagnosticsStatus":null,"reportingTaskStatusList":null,"errorsGeneratingReport":[]}
$ /minifi.sh flowStatus processor:MyProcessorName:health,stats,bulletins
{"controllerServiceStatusList":null,"processorStatusList":[],"connectionStatusList":null,"remoteProcessGroupStatusList":null,"instanceStatus":null,"systemDiagnosticsStatus":null,"reportingTaskStatusList":null,"errorsGeneratingReport":["Unable to get status for request 'processor:MyProcessorName:health,stats,bulletins' due to:org.apache.nifi.minifi.status.StatusRequestException: No processor with key MyProcessorName to report status on"]}
but I'm receiving only nulls. What should I do to be able retrieve data which I want (enable some option in config?)? Is it possible using flowStatus queries? My flow which is running contains several processors, so why systemdiagnostics shows only two availableProcessors and why I can't use flowStatus processor command to get any processor data?
Unfortunately NiFi/MiNiFi documentation is very poor, so I'm not even sure if I can retrieve processors data (number of queued elements and processors list) in this way. If not, maybe do you know some other way to do it?
Do you have any processors in a flow running on this instance of MiNiFi? Each response from the queries you've submitted show no processors. In fact, the third example says this explicitly -- "Unable to get status for request 'processor:MyProcessorName:health,stats,bulletins' due to:org.apache.nifi.minifi.status.StatusRequestException: No processor with key MyProcessorName to report status on".

Resources