How are threads of Processors invoked in Nifi flow? - apache-nifi

I'm trying to learn writing custom Nifi Processor and from the documentation, the processor should be thread-safe. What I wanted to understand is, if, say - I have 100 flow file records connected to my custom processor, would my processor's onTrigger method ( assume that I haven't enabled #TriggerSerially on this method ) be triggered 100 times and in 100 separate threads ( irrespective of concurrently or not ), or is there a possibility that one flow file is used as input to more than one thread of onTrigger method on my processor.
I apologize if I didn't articulate the question correctly, but essentially, is is possible that the number of times my processor's onTrigger method is triggered, is more than the number of flow files that are connected as input to the processor?

The number of threads executing a processor is based on the number of concurrent tasks on the scheduling tab, which defaults to 1. If you increase this to 2, then 2 threads are concurrently executing the onTrigger method. A single flow file will only be processed by one of these threads.
The #TriggerSerially annotation prevents you from being able to increase the conccurent tasks, so it forces there to never be concurrent execution. A common use case for this would be a source processor that is pulling data from somewhere, typically you wouldn't to concurrently be pulling the same data twice.

Related

How to execute multiple parallel controllers simultaneously?

I am having 2 parallel controllers in my thread group, in both parallel controllers I have added 1 simple controller in which there are 10 API requests. When I execute the script with 2 threads and 2 sec ramp up time to check the start time of the thread groups then I observe that script completes 1 thread's parallel controller first then 2nd thread's parallel controller.
Scenario - To scan the barcodes from csv file.
Expected Result: Expected_Result_table_ss
and in the meanwhile when Parallel 2 is processing the barcode, the first controller should also pick the next barcode to scan. For ex when the 2nd parallel controller is scanning 110121 at 12:00:01:190 at the same second or after some milliseconds parallel controller 1 should pick next barcode 110123 to scan.
Actual Result: ActualResult_TABLE_ss
Jmeter Script Flow : Jmeter_Script_Flow_ss
I may be wrong but I think this execution will be more precise if I am able to execute these parallel controllers simultaneously. Please let me know if any other logic can be applied for the same to scan the barcodes simultaneously using script
Parallel Controller executes its direct children in parallel, your "Jmeter Script Flow" means sequential execution of all "API requests" by each thread (virtual user)
I don't think you understand the concept and use case for the Parallel Controller, it was implemented as a JMeter Plugin to overcome JMeter limitation of not being possible to kick off extra threads within the bounds of one virtual user making simulating AJAX calls quite hard. If scanning the barcode really produces 10 requests at exactly the same moment then your setup is good and you just need to move the "API requests" out of the "Simple Controller" so they would be direct children of the
JMeter threads are absolutely independent and know nothing about each other unless you use Inter-Thread Communication Plugin so each virtual user will be executing Samplers upside down as fast as it can
If you want to execute both scenarios at the same time with 2 users - take a look at Synchronizing Timer

Design Pattern - Spring KafkaListener processing 1 million records in 1 hour

My spring boot application is going to listen to 1 million records an hour from a kafka broker. The entire processing logic for each message takes 1-1.5 seconds including a database insert. Broker has 64 partitions, which is also the concurrency of my #KafkaListener.
My current code is only able to process 90 records in a minute in a lower environment where I am listening to around 50k records an hour. Below is the code and all other config parameters like max.poll.records etc are default values:
#KafkaListener(id="xyz-listener", concurrency="64", topics="my-topic")
public void listener(String record) {
// processing logic
}
I do get "it is likely that the consumer was kicked out of the group" 7-8 times an hour. I think both of these issues can be solved through isolating listener method and multithreading processing of each message but I am not sure how to do that.
There are a few points to consider here. First, 64 consumers seems a bit too much for a single application to handle consistently.
Considering each poll by default fetches 500 records per consumer at a time, your app might be getting overloaded and causing the consumers to get kicked out of the group if a single batch takes more than the 5 minutes default for max.poll.timeout.ms to be processed.
So first, I'd consider scaling the application horizontally so that each application handles a smaller amount of partitions / threads.
A second way to increase throughput would be using a batch listener, and handling processing and DB insertions in batches as you can see in this answer.
Using both, you should be processing a sensible amount of work in parallel per app, and should be able to achieve your desired throughput.
Of course, you should load test each approach with different figures to have proper metrics.
EDIT: Addressing your comment, if you want to achieve this throughput I wouldn't give up on batch processing just yet. If you do the DB operations row by row you'll need a lot more resources for the same performance.
If your rule engine doesn't do any I/O you can iterate each record from the batch through it without losing performance.
About data consistency, you can try some strategies. For example, you can have a lock to ensure that even through a rebalance only one instance will process a given batch of records at a given time - or perhaps there's a more idiomatic way of handling that in Kafka using the rebalance hooks.
With that in place, you can batch load all the information you need to filter out duplicated / outdated records when you receive the records, iterate each record through the rule engine in memory, and then batch persist all results, to then release the lock.
Of course, it's hard to come up with an ideal strategy without knowing more details about the process. The point is by doing that you should be able to handle around 10x more records within each instance, so I'd definitely give it a shot.

Why does Nifi PutParquet processor create so many tasks?

The Nifi PutParquet Processor with timer driven run schedule of 0 sec with previous processor in stopped status shows ~3000 Tasks for the last 5 minutes.
We are on Nifi 1.9.2.
My expectation would be that this processor only creates tasks if data is in the incoming queue for the processor. Is this some misconfiguration or a bug in the implementation?
The processor is annotated with #TriggerWhenEmpty which lets it execute all the time regardless of data in the incoming queue. The reason for this is because in a kerberized environment, the processor needs a chance to refresh the credentials. It was a common problem with other processors where no data comes in for a long time, say over a weekend, and during that time the kerberos ticket expired, and then when data starts coming in Monday everything fails.
These empty executions shouldn't have a big impact on the system. When the processor executes and no data is available, it just calls yield and returns. The default yield duration is 1 second, but is controllable through the UI.

Is Spring batch's default behavior is to process next item only after first item finished?

After reading this article about the possibilities of scaling and parallel processing in Spring-Batch we were wondering, what is the out-of-the-box behavior of Spring-batch?
Let's say our job has reader, 5 steps and a writer.
Will Spring-batch read one item, pass it through all the 5 steps, write it and only then move on to the next item? Something like a giant for loop?
Or is there some parallelism, so while item A is moved on to step 2, item B is read and handled to step 1?
I think you are misunderstanding how Spring Batch works. Let me start with that, then go into parallelism.
A chunk based step in Spring Batch consists of an ItemReader, an optional ItemProcessor, then an ItemWriter. Each of these obviously supports composition (Spring Batch provides some components for using composition in both the ItemProcessor and ItemWriter phases). Within that step, Spring Batch reads items until a given condition is met (typically chunk size). Then that list is iterated over, passing each item to the ItemProcessor. Finally, a list of all of the results from the ItemProcessor calls is passed in a single call to the ItemWriter. The concept of reading once, then doing multiple steps, then writing really isn't how Spring Batch works. The closest we get would be a single ItemReader, then using composition to create a chain of ItemProcessor calls, then a single call to an ItemWriter.
With that being said, Spring Batch provides a number of parallelism options. There are five different options for scaling Spring Batch jobs. I won't go into details about each because that's beyond the scope of this and clearly discussed in other StackOverflow questions as well as the documentation. However, the list is as follows:
Multithreaded steps - Here each chunk (block of items processed within a transaction) is executed within a different thread using Spring's TaskExecutor abstraction.
Parallel steps - Here a batch job executes multiple, independent steps in parallel again using Spring's TaskExecutor abstraction to control the theads used.
AsyncItemProcessor/AsyncItemWriter - Here each call to the ItemProcessor is called in it's own thread. The resulting Future is passed to the AsyncItemWriter which unwraps the Future and the results are persisted.
Partitioning - Spring Batch allows you to partition a data set into multiple partitions that are then executed in parallel either via local threading mechanisms or remotely.
Remote chunking - The last option is to have a master reading the data, then sending it to a pool of workers for processing and writing.

Throttle calls to ItemProcessor in Spring batch

I've created spring batch which reads from flat file and process the data using ItemProcessor before writing in the DB using ItemWriter, everything so for works fine.
The problem now I need to control the number of times "Process" method is called for processing the data, my itemprocessor calls some API with details, the API will take some time to respond (not sure about the timeout), Hence, i should not overload the API with new messages. I need to control the calls to API, e.g X number of call in y Sec if it reaches, i need to wait for Z sec before resuming the activity.
I am not sure how to achieve this in spring batch, I am looking at implementing chunklistener in processor to track the calls. However, I am looking for a better approach.
You do not need a listener to do this.
If you do not define an asynchronous taskexecutor in your stop, then the whole processing is completely sequentially.
It will read an item, process an item, reads the next item, processes it until it as read and processed as many items as you defined in your commitsize (-> the size of your chunks). After that, it will put those items into a list and forward this list to the writer. This process will be executed, until all elements have been read, processed, and finally written.
If you would like to process your chunks in parallel, then you can define an asynchronous taskexecutor.
If you define an AsyncTaskExeutor in your step, you are able to configure the number of threads this TaskExecutor manages/creates. Moreover, you can also define the throttlelimit of your step which defines the number of chunks that can be be processed in parallel.

Resources