Apache NiFi - How to pull all files thru GetSFTP processor only if a particular text file is available else ignore the files - apache-nifi

Will be having many files daily, but need to pull them only if particular text file is in the list (which indicates all files are ready to pull), through GetSFTP Processor.
This process involved pulling files from SFTP and copying to aws-s3.
I know an alternate process to write a script and pull them through the script but I am looking to achieve the same with processors without a script.

Related

Can I delete file in Nifi after send messages to kafka?

Hi I'm using nifi as an ETL tool.
Process IMG
This is my current process. I use TailFile to detect CSV file and then send messages to Kafka.
It works fine so far, but i want to delete CSV file after i send contents of csv to Kafka.
Is there any way?
Thanks
This depends on why you are using TailFile. From the docs,
"Tails" a file, or a list of files, ingesting data from the file as it is written to the file
TailFile is used to get new lines that are added to the same file, as they are written. If you need to a tail a file that is being written to, what condition determines it is no longer being written to?
However, if you are just consuming complete files from the local file system, then you could use GetFile which gives the option to delete the file after it is consumed.
From a remote file system, you could use ListSFTP and FetchSFTP which has a Completion Strategy to move or delete.

ListFile processor, force processor to list full directory everytime

My use case.
Some processing somewhere else add files to some dir (_use_it) -> call my flow using REST -> Now I want my process to read all files from mentioned directory (_use_it).
I want to read all files everytime from this directory, not just changed/added files. I can't start/stop process. This flow has to run as a background process.
I think, I am looking for ListFile processor to run once, then stop, and then when It runs again, it forgets previous state. "some twisted logic" :)
Thanks
1. Using GetFile Processor:
You can use GetFile processor instead of ListFile + FetchFile processors and GetFile processor doesn't store the state.
GetFile processor Gets all the files in the directory every time.
Keep Source File property If true, the file is not deleted after it
has been copied to the Content Repository; this causes the file to be
picked up continually and is useful for testing purposes. If not
keeping original NiFi will need write permissions on the directory it
is pulling from otherwise it will ignore the file.
(or)
2. Using ListFile Processor:
Making use of NiFi RestAPI we can clear the state of list file processor and then processor will list out all files in the directory every time.
Clear state of the processor:
POST
/processors/{id}/state/clear-requests
Before you are starting the Listing all files in the directory flow
Use Rest Api to stop the ListFile processor
Clear the state of ListFile processor
Start the ListFile processor.
Refer to this and this links to STOP the processor via RestApi

nifi putFile process not writing to directory

I am trying to write FlowFiles from a ConsumerAMQP process to files and attempted to use a PutFile process. But the files do not end up in the directory. I am new to NiFi so not sure how to debug this but did get an idea online to use a Funnel and see where the FlowFiles end up. They do end up in the success process so it indicates to me that the files make it to the process but not being submitted to the output directory. I also do not know how to specify the filenames or file extension and how to write a specific number of flowfiles to a file at a time.
The settings below:
These are the processes in the flow:
Could someone please advise?
this shows the actual putFile read/write count but no out count and no errors:

Nifi: How to sync two directories in nifi

I have to write my response flowfiles in one directory than get data from it change it and then put it inside other dierctory i want to make this two direcotry sync(i mean that whenever i delet, or change flowfile in one directory it should change in other directories too ) I have ore than 10000 flowfiles so chechlist wouldn't be good solution. Can you reccomend me:
any contreoller service which can help me make this?
any better way i can make this task without controller service
You can use a combination of ListFile, FetchFile, and PutFile processors to detect individual file write changes within a file system directory and copy their contents to another directory. This will not detect file deletions however, so I believe a better solution is to use rsync within an ExecuteProcess processor.
To the best of my knowledge, rsync does not work on HDFS file systems, so in that case I would recommend using a tool like Helix or DistCp (I have not evaluated these tools in particular). You can either invoke them from the "command line" via ExecuteProcess or wrapping a client library in an ExecuteScript or custom processor.

Spring batch job start processing file not fully uploaded to the SFTP server

I have a spring-batch job scanning the SFTP server at a given interval. When it finds a new file, it starts the processing.
It works fine for most cases, but there is one case when it doesn't work:
User starts uploading a new file to the SFTP server
Batch job checks the server and finds a new file
It start processing it
But since the file is still being uploaded, during the processing it encounters unexpected end of input block, and the error occurs.
How can I check that file was fully uploaded to the SFTP server before batch job processing starts?
Locking files while uploading / Upload to temporary file name
You may have an automated system monitoring a remote folder and you want to prevent it from accidentally picking a file that has not finished uploading yet. As majority of SFTP and FTP servers (WebDAV being an exception) do not support file locking, you need to prevent the automated system from picking the file otherwise.
Common workarounds are:
Upload “done” file once an upload of data files finishes and have
the automated system wait for the “done” file before processing the
data files. This is easy solution, but won’t work in multi-user
environment.
Upload data files to temporary (“upload”) folder and move them atomically to target folder once the upload finishes.
Upload data files to distinct temporary name, e.g. with .filepart extension, and rename them atomically once the upload finishes. Have the automated system ignore the .filepart files.
Got from here
We had similar problem, Our solution was, we configured spring-batch cron trigger to trigger the job every 10min(though we could configure for 5min, as file transfer was taking less than 3min), then we read/process all the files created prior to 10 minutes. We assume the FTP operation completes within 3 minutes. This gave us some additional flexibility such as when spring-batch app was down etc.
For example if the batch job triggered at 10:20AM we read all the files that were created before 10:10AM, like-wise job that runs at 10:30, reads all the files created before 10:20.
Note: Once Read you need to either delete or move to history folder for duplicate reads.

Resources