NiFi how to release flow file until a process downstream is finished - apache-nifi

I am designing a data ingestion pattern using NiFi. One process needs to stop releasing flow files until a process downstream has finished processed. I tried to use wait and notified and have not made any success. I am hoping if the queue size and back pressure can be set across a few processors.
Similarly if there's a way I can implement logic: Don't allow flow files go in if there is one currently processing between multiple processors.
Any help is appreciated

You need a combination of MonitorActivity with executestreamcommand (with a python "nipyapi" script).
I have a similar requirement in one of my working flows.
You will need to install python lib nipyapi first and create this script on the nifi box.
from time import sleep
import nipyapi
nipyapi.utils.set_endpoint('http://ipaddress:port/nifi-api', ssl=False, login=False)
## Get PG ID using the PG Name
mypg = nipyapi.canvas.get_process_group('start')
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=True) ## Start
sleep(1)
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=False) ## Stop
I will put the template in the img in the link bellow, see the configuration on the monitor-activity processor - it will generate a flow if not activity is happening for 10 sec(you can play with the times thou).
Download template
Note: this is not a very good approach if you have high latency requirements.
Another idea would be to monitor the aggregate queue in the entire flow and if queue is zero then you restart start flow. (this would be very intense if you have a lot of connections)

I was able to design a solution within NiFi. Essentially using generate flow file as a signal (Only run once ever). The trick is have the newly generated flow file to merge with the original input flow through defragmentation. And every time after the flow has finished, the success condition will be able to merge with the next input flow file.
Solution Flow

Related

Spring batch file resume after Server failure

I have configured Spring Boot Batch to process Fixed length flat file. I read and split columns by using FlatFileItemReader, FixedLengthTokenizer and Writing data into Database by using ItemWriter, JPA Repository.
I have a scenario like, My Server was crashed or it was stopped at the time of file processing. At this point half of the file was processed(means half of the data wrote into DB). When it comes to next Job(when server was running up) the file has to start from where it stops.
For Example, A file having 1000 lines, Server was shutdown after processing 500 rows. In the next Job, The file has to start from 501 row.
I googled for solution but nothing relevant. Any help appreciated.
As far as I know, what you are asking ( restart at chunk level ) doesn't automatically exist in Spring Batch API & is something that programmer has to implement on his/her own.
Spring Batch provides Job restart feature via JobOperator.restart . This is a job level restart and a new execution id will be created for next run & whole of the job will rerun as there are other concerns like somebody put in a new file or renamed existing file to process in place of old file , how batch will know that its same input file content wise or db not changed since last run?
Due to these concerns , its imperative that programmer handles these situations via custom code.
Second concern is that when there is a server failure, job status would still be STARTED & not FAILED since it happens all of a sudden and framework couldn't update status correctly.
Following steps you need to implement ,
1.Implement a custom logic to decide if last job execution was successful or restart is needed.
2.If restart is needed, mark previous job execution as FAILED & then use JobOperator.restart(long executionId) - For a non - partitioned job , only useful impact would be the marking of job status to be correct as FAILED but whole job will restart from beginning.
There are many scenarios like,
a)job status is STARTED but all steps are marked COMPELTEDetc
b)For a partitioned job, few steps are completed , few failed & few in started etc
3.If restart is not needed, launch a new job using - JobLauncher.run.
So with above steps, you see that a real chunk level job restart is not achieved but above steps are primary things that you first need to understand & implement.
Next would be to changing your input at job restart i.e. you devise a mechanism to mark input records as processed for processed chunks ( i.e. read , processed & written ) & have a way to know what input records are not processed - then in next job run you feed modified input that is still unprocessed. So its all going to be your use case specific custom logic.
I am not aware of any inbuilt mechanism in the framework itself to achieve this. To me a Job Restart is a brand new job execution with modified / reduced input.

Synchronize NiFi process groups or flows that don't/can't connect?

Like the question states, is there some way to synchronize NiFi process groups or pipelines that don't/can't connect in the UI?
Eg. I have a process where I want to getFTP->putHDFS->moveHDFS (which ends up actually being getFTP->putHDFS->listHDFS->moveHDFS, see https://stackoverflow.com/a/50166151/8236733). However, listHDFS does not seem to take any incoming connections. Trying to do something with process groups like P1{getFTP->putHDFS->outport}->P2{inport->listHDFS->moveHDFS} also runs into the same problem (listHDFS can't seem to take any incoming connections). We don't want to moveHDFS before we ever even get anything from getFTP, but given the above, I don't see how these actions can be synchronized to occur in the right order.
New to NiFi, but I imagine this is a common use case and there must be some NiFi-ish way of doing this that I am missing. Advice in this would be appreciated. Thanks.
I'm not sure what requirement is preventing you from writing the file retrieved from FTP directly to the desired HDFS location, or if this is a "write n files to HDFS with a . starting the filename and then rename all when some certain threshold is reached" scenario.
ListHDFS does not take any incoming relationships because it should not be triggered by an incoming event, but rather on a timer/CRON schedule. Every time it runs, it will produce n flowfiles, where each references an HDFS file that has been detected to be written to the filesystem since the last execution. To do this, the processor stores local state.
Your flow segments do not need to be connected in this case. You'll have "flow segment A" which performs the FTP -> HDFS writing (GetFTP -> PutHDFS) and you'll have an independent "flow segment B" which lists the HDFS directory, reads the file descriptors (but not the content of the file unless you use FetchHDFS as well) and moves them (ListHDFS -> MoveHDFS). The ListHDFS processor will run constantly, but if it does not detect any new files during a run, it will simply yield and perform a no-op. Once the PutHDFS processor completes the task of writing a file to the HDFS file system, on the next ListHDFS execution, it will detect that file and generate a flowfile describing it.
You can tune the scheduling to your liking, but in general this is a very common pattern in NiFi flows.

Parallel Processing with Starting New Task - front end screen timeout

I am running an ABAP program to work with a huge amount of data. The SAP documentation gives the information that I should use
Remote Function Modules with the addition STARTING NEW TASK to process the data.
So my program first selects all the data, breaks the data into packages and calls a function module with a package of data for further processing.
So that's my pseudo code:
Select KEYFIELD from MYSAP_TABLE into table KEY_TABLE package size 500.
append KEY_TABLE to ALL_KEYS_TABLE.
Endselect.
Loop at ALL_KEYS_TABLE assigning <fs_table> .
call function 'Z_MASS_PROCESSING'
starting new TASK 'TEST' destination in group default
exporting
IT_DATA = <fs_table> .
Endloop .
But I am surprised to see that I am using Dialog Processes instead of Background Process for the call of my function module.
So now I encountered the problem that one of my Dialog Processes were killed after 60 Minutes because of Timeout.
For me, it seems that STARTING NEW TASK is not the right solution for parallel processing of mass data.
What will be the alternative?
As already mentioned, thats not an easy topic that is handled with a few lines of codes. The general steps you have to conduct in a thoughtful way to gain the desired benefit is:
1) Get free work processes available for parallel processing
2) Slice your data in packages to be processed
3) Call an RFC enabled function module asynchronously for each package with the available work processes. Handle waiting for free work processes, if packages > available processes
4) Receive your results asynchronously
5) Wait till everything is processed and merge the data together again and assure that every package was handled properly
Although it is bad practice to just post links, the code is very long and would make this answer very messy, therfore take a look at the following links:
Example1-aRFC
Example2-aRFC
Example3-aRFC
Other RFC variants (e.g. qRFC, tRFC etc.) can be found here with short description but sadly cannot give you further insight on them.
EDIT:
Regarding process type of aRFC:
In parallel processing, a job step is started as usual in a background
processing work process. (...)While the job itself runs in a
background process, the parallel processing tasks that it starts run
in dialog work processes. Such dialog work processes may be located on
any SAP server.
The server is specified with the GROUP (default: parallel_generators) see transaction RZ12 and can have its own ressources just for parallel processing. If your process times out, you have to slice your packages differently in size.
I think, best way for parallel processing in SAP is Bank Parallel Processing framework as Jagger mentioned. Unfortunently its rarerly mentioned in any resource and its not documented well.
Actually, best documentation I found was in this book
https://www.sap-press.com/abap-performance-tuning_2092/
Yes, it's tricky. It costed me about 5 or 6 days to force it going. But results were good.
All stuff is situated in package BANK_PP_JOBCTRL and you can use its name for googling.
Main idea there is to divide all your work into steps (simplified):
Preparation
Parallel processing
2.1. Processing preparation
2.2. Processing
(Actually there are more steps there)
First step is not paralleized. Here you should prepare all you data for parallel processing and devide it into 'piece' which will be processed in parallel.
Content of pieces, in turn, can be ID or preloaded data as well.
After that, you can run step 2 in parallel processing.
Great benefit of all this is that error in one piece of parallel work won't lead to crash of all your processing.
I recomend you check demo in function group BANK_API_PP_DEMO
To implement parallel processing, you need to do a bit more than just add that clause. The information is contained in this help topic. A lot of design effort needs to be devoted to ensure that the communication and result merging overhead of the parallel processing does not negate the performance advantage gained by the parallel processing in the first place and that referential integrity of the data is maintained even when some of the parallel tasks fail. Do not under-estimate the complexity of this task.
You could make use of the bgRFC technique. This is a new method of background processing made by SAP.
BgRFC has, in addition to the already existing IN BACKGROUND TASK, the possibility to configure and monitor all calls which run through this method.
You can read more documentation between the different possibilities here. This is all (of course) depending on your SAP version.

How to empty all the queues in Nifi at a time?

I have started exploring NiFi. I have built some flow which is working. But i want to clear all the queues at a time to test the flow each time if i made any changes. I know we can stop and start each processor and test step by step. But i want to know is there a way we can clear all the queues at a time.
the easiest way to stop nifi, delete the following folders, and start it again:
content_repository
database_repository
flowfile_repository
provenance_repository
another approach to use nifi-api to get list of all queues and then call function to empty them.

Does Apache NiFi support batch processing?

I need to know if Apache NiFi supports running processors until completion.
"the execution of a series of processors in process group wait for anothor process group results execution to be complete".
For example:
Suppose there are three processors in NiFi UI.
P1-->P2-->P3
P-->Processor
Now I need to run P1 if it run completely then run P2 And finally it will run like sequence but one wait for another to be complete.
EDIT-1:
Just for example I have data in web URL. I can download that data using GetHTTP Processor. Now I stored that in putFile content. If file saved in putFile directory then run FetchFile to process that file into my database like below workflow.
GetHTTP-->PutFile-->FetchFile-->DB
Is this possible?
NiFi itself is not really a batch processing system, it is a data flow system more geared towards continuous processing. Having said that, there are some techniques you can use to do batch-like operations, depending on which processors you're using.
The Split processors (SplitText, SplitJSON, etc.) write attributes to the flow files that include a "fragment.identifier" which is unique for all splits created from an incoming flow file, and "fragment.count" which is the total number of those splits. Processors like MergeContent use those attributes to process a whole batch (aka fragment), so the output from those kinds of processors would occur after an entire batch/fragment has been processed.
Another technique is to write an empty file in a temp directory when the job is complete, then a ListFile processor (pointing at that temp directory) would issue a flow file when the file is detected.
Can you describe more about the processors in your flow, and how you would know when a batch was complete?

Resources