Can I delete file in Nifi after send messages to kafka? - apache-nifi

Hi I'm using nifi as an ETL tool.
Process IMG
This is my current process. I use TailFile to detect CSV file and then send messages to Kafka.
It works fine so far, but i want to delete CSV file after i send contents of csv to Kafka.
Is there any way?
Thanks

This depends on why you are using TailFile. From the docs,
"Tails" a file, or a list of files, ingesting data from the file as it is written to the file
TailFile is used to get new lines that are added to the same file, as they are written. If you need to a tail a file that is being written to, what condition determines it is no longer being written to?
However, if you are just consuming complete files from the local file system, then you could use GetFile which gives the option to delete the file after it is consumed.
From a remote file system, you could use ListSFTP and FetchSFTP which has a Completion Strategy to move or delete.

Related

Apache NiFi - How to pull all files thru GetSFTP processor only if a particular text file is available else ignore the files

Will be having many files daily, but need to pull them only if particular text file is in the list (which indicates all files are ready to pull), through GetSFTP Processor.
This process involved pulling files from SFTP and copying to aws-s3.
I know an alternate process to write a script and pull them through the script but I am looking to achieve the same with processors without a script.

filebeat modify data enriching json from other sources

Log format consist on json encoded in line by line format.
Each line is
{data,payload:/local/path/to/file}
{data,payload:/another/file}
{data,payload:/a/different/file}
the initial idea is configure logstash to use http input, write a java (or anything) daemon that get the file, parse it line by line, replace the payload with the content of file, and send the data to logstash.
I can't modify how the server work, so log format can't be changed.
Logstash machine are different host, so no direct access to files.
Logstash can't mount a shared folder from the server_host.
I can't open port apart a single port for logstash due to compliance of the solution that need ot respect some silly rules that aren't under my control.
Now, to save some times and have a more reliable than a custom-made solution, it's possible to configure filebeat to process every line of json, before sending it to logstash, adding to it
{data,payload:content_of_the_file}
Filebeat won't be able to do advanced transformations of this kind, as it is only meant to forward logs, it can't even do basic string processing like logstash does. I suggest you write a custom script that does this transformation & writes the output to a different file.
You can use filebeat to send the contents of this new file to logstash.

Nifi: How to sync two directories in nifi

I have to write my response flowfiles in one directory than get data from it change it and then put it inside other dierctory i want to make this two direcotry sync(i mean that whenever i delet, or change flowfile in one directory it should change in other directories too ) I have ore than 10000 flowfiles so chechlist wouldn't be good solution. Can you reccomend me:
any contreoller service which can help me make this?
any better way i can make this task without controller service
You can use a combination of ListFile, FetchFile, and PutFile processors to detect individual file write changes within a file system directory and copy their contents to another directory. This will not detect file deletions however, so I believe a better solution is to use rsync within an ExecuteProcess processor.
To the best of my knowledge, rsync does not work on HDFS file systems, so in that case I would recommend using a tool like Helix or DistCp (I have not evaluated these tools in particular). You can either invoke them from the "command line" via ExecuteProcess or wrapping a client library in an ExecuteScript or custom processor.

Continuously Combining local file with files downloaded from S3

I have a Nifi flow where I am fetching files from S3. A pair of files are fetched through S3 and later passed into a MergeContent processor. Next, there is a README file that needs to go with each pair of files.
This README file is always same and I have stored it locally. I have a ExecuteStreamCommand that takes in content from the MergeContent processor.
I have tried passing in the README file into the MergeContent processor using the ListFile/FetchFile combination but its not working as expected. I guess the final result that I am looking for is a MergeContent package that contains a pair of files downloaded from S3 + the README file.
I think in this case you will want to use GetFile for the README -- the path is static, and you can set the Keep Source File setting to true in order to constantly retrieve the same content.
ListFile/FetchFile probably isn't working because once ListFile retrieves a filename from the directory, it stores the timestamp in its local state and won't retrieve files older than that on the next execution.

How to process multiple CSV format files using Spring batch

I am using spring batch to process my inbound files, below is my use-case
will be receiving a zip contains 15 files of CSV format
I need to process them in parallel
after all files were processed need to do some calculation and report should be send out.
Could anyone suggest me how to implement this using Spring Batch.
I would like to follow the below approach
Partitioner
Unzip the zip file
For each of CSV file, create a ExecutionContext and add to Queue for pararell processing.
Reader will be CSV Reader provided by Spring Batch.
Listener will be used to send Report when all processes are done.
Please refer this one as an example.
If you want exactly the same as your requirement, please let me know I can post one for you.
Nghia

Resources