Does Apache NiFi support batch processing? - apache-nifi

I need to know if Apache NiFi supports running processors until completion.
"the execution of a series of processors in process group wait for anothor process group results execution to be complete".
For example:
Suppose there are three processors in NiFi UI.
P1-->P2-->P3
P-->Processor
Now I need to run P1 if it run completely then run P2 And finally it will run like sequence but one wait for another to be complete.
EDIT-1:
Just for example I have data in web URL. I can download that data using GetHTTP Processor. Now I stored that in putFile content. If file saved in putFile directory then run FetchFile to process that file into my database like below workflow.
GetHTTP-->PutFile-->FetchFile-->DB
Is this possible?

NiFi itself is not really a batch processing system, it is a data flow system more geared towards continuous processing. Having said that, there are some techniques you can use to do batch-like operations, depending on which processors you're using.
The Split processors (SplitText, SplitJSON, etc.) write attributes to the flow files that include a "fragment.identifier" which is unique for all splits created from an incoming flow file, and "fragment.count" which is the total number of those splits. Processors like MergeContent use those attributes to process a whole batch (aka fragment), so the output from those kinds of processors would occur after an entire batch/fragment has been processed.
Another technique is to write an empty file in a temp directory when the job is complete, then a ListFile processor (pointing at that temp directory) would issue a flow file when the file is detected.
Can you describe more about the processors in your flow, and how you would know when a batch was complete?

Related

Run multiple batch jobs in parallel Mule 4

I have a batch job that reads hundreds of images from an SFTP location and then encodes them into base64 and uploads them via API using HTTP connector.
I would like to make the process run quicker and hence trying to split the payload into 2 via scatter-gather and then sending then sending payload1 to one batch job in a subflow and payload2 to another batch job in another subflow.
Is this the right approach?
Or is it possible to split the load in just one batch process, ie for one half of the payload to be processed by batch step 1 and second half will be processed by batch step 2 at the same time?
Thank you
No, it is not a good approach. Batch jobs are always executed asynchronously (ie using different threads) so there is no benefit on using scatter-gather and it has the cons of increasing resource usage.
Splitting the payload in different batch steps doesn't make sense either. You should not try to scale by adding steps.
Batch jobs should be used naturally to work in parallel by iterating on an input. It may be able to handle the splitting itself or you can manually split the input payload before. Then let it handle the concurrency automatically. There are some configurations you can use to tune it, like block sizing.

How to make flowfiles received from ListFile processor wait until processing of one particular flowfile among them (if present) is completed?

Suppose I've a directory which contains multiple files. I want to list the directory, fetch all files and process them. But if there is a flowfile with a particular filename (e.g., file.txt) then I want to process this flowfile first before processing any other one. Please note I can't list the directory again due to my use case limitations. It has to be in a single flow.
You can start with something similar to below flow. Use Wait-Notify to implement gate like mechanism. But I think, to work this as expected you need to set Run Schedule for ListFile and the execution interval should be greater than expiration duration of Wait processor so that if a specific file is not present in the list of that execution attempt still those files will be processed before next execution of ListFile and won't be stuck at Wait processor queue!

NiFi how to release flow file until a process downstream is finished

I am designing a data ingestion pattern using NiFi. One process needs to stop releasing flow files until a process downstream has finished processed. I tried to use wait and notified and have not made any success. I am hoping if the queue size and back pressure can be set across a few processors.
Similarly if there's a way I can implement logic: Don't allow flow files go in if there is one currently processing between multiple processors.
Any help is appreciated
You need a combination of MonitorActivity with executestreamcommand (with a python "nipyapi" script).
I have a similar requirement in one of my working flows.
You will need to install python lib nipyapi first and create this script on the nifi box.
from time import sleep
import nipyapi
nipyapi.utils.set_endpoint('http://ipaddress:port/nifi-api', ssl=False, login=False)
## Get PG ID using the PG Name
mypg = nipyapi.canvas.get_process_group('start')
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=True) ## Start
sleep(1)
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=False) ## Stop
I will put the template in the img in the link bellow, see the configuration on the monitor-activity processor - it will generate a flow if not activity is happening for 10 sec(you can play with the times thou).
Download template
Note: this is not a very good approach if you have high latency requirements.
Another idea would be to monitor the aggregate queue in the entire flow and if queue is zero then you restart start flow. (this would be very intense if you have a lot of connections)
I was able to design a solution within NiFi. Essentially using generate flow file as a signal (Only run once ever). The trick is have the newly generated flow file to merge with the original input flow through defragmentation. And every time after the flow has finished, the success condition will be able to merge with the next input flow file.
Solution Flow

Synchronize NiFi process groups or flows that don't/can't connect?

Like the question states, is there some way to synchronize NiFi process groups or pipelines that don't/can't connect in the UI?
Eg. I have a process where I want to getFTP->putHDFS->moveHDFS (which ends up actually being getFTP->putHDFS->listHDFS->moveHDFS, see https://stackoverflow.com/a/50166151/8236733). However, listHDFS does not seem to take any incoming connections. Trying to do something with process groups like P1{getFTP->putHDFS->outport}->P2{inport->listHDFS->moveHDFS} also runs into the same problem (listHDFS can't seem to take any incoming connections). We don't want to moveHDFS before we ever even get anything from getFTP, but given the above, I don't see how these actions can be synchronized to occur in the right order.
New to NiFi, but I imagine this is a common use case and there must be some NiFi-ish way of doing this that I am missing. Advice in this would be appreciated. Thanks.
I'm not sure what requirement is preventing you from writing the file retrieved from FTP directly to the desired HDFS location, or if this is a "write n files to HDFS with a . starting the filename and then rename all when some certain threshold is reached" scenario.
ListHDFS does not take any incoming relationships because it should not be triggered by an incoming event, but rather on a timer/CRON schedule. Every time it runs, it will produce n flowfiles, where each references an HDFS file that has been detected to be written to the filesystem since the last execution. To do this, the processor stores local state.
Your flow segments do not need to be connected in this case. You'll have "flow segment A" which performs the FTP -> HDFS writing (GetFTP -> PutHDFS) and you'll have an independent "flow segment B" which lists the HDFS directory, reads the file descriptors (but not the content of the file unless you use FetchHDFS as well) and moves them (ListHDFS -> MoveHDFS). The ListHDFS processor will run constantly, but if it does not detect any new files during a run, it will simply yield and perform a no-op. Once the PutHDFS processor completes the task of writing a file to the HDFS file system, on the next ListHDFS execution, it will detect that file and generate a flowfile describing it.
You can tune the scheduling to your liking, but in general this is a very common pattern in NiFi flows.

Spring Batch : multi file processing using multi threads

I have requirement where I have to deal with multiple files (say 300 csv files).
I need to read --> process --> write, each individual file as I need to apply some transformation logic on the data.
For each input file there would be a corresponding transformed file. so for 300 input files we would have 300 output files.
At the end, all the 300 output files are needed to be merged into a single file which would be compressed and then transferred to a remote location over FTP/SFTP.
Say, every hour we would have to deal with a new set of 300 file on which we would be required to apply the above processing, so we would be scheduling the above job per hour.
How to handle multi file processing in the above scenario using Spring Batch ?
How to make the above processing to happen in multiple threads ?
Please suggest.
Thanks in advance.
You can use spring task execution and scheduling and then use java ThreadPoolExecutor
Check this answer here at SO for a very simple example.

Resources