Nifi: How to sync two directories in nifi - apache-nifi

I have to write my response flowfiles in one directory than get data from it change it and then put it inside other dierctory i want to make this two direcotry sync(i mean that whenever i delet, or change flowfile in one directory it should change in other directories too ) I have ore than 10000 flowfiles so chechlist wouldn't be good solution. Can you reccomend me:
any contreoller service which can help me make this?
any better way i can make this task without controller service

You can use a combination of ListFile, FetchFile, and PutFile processors to detect individual file write changes within a file system directory and copy their contents to another directory. This will not detect file deletions however, so I believe a better solution is to use rsync within an ExecuteProcess processor.
To the best of my knowledge, rsync does not work on HDFS file systems, so in that case I would recommend using a tool like Helix or DistCp (I have not evaluated these tools in particular). You can either invoke them from the "command line" via ExecuteProcess or wrapping a client library in an ExecuteScript or custom processor.

Related

Apache NiFi - How to pull all files thru GetSFTP processor only if a particular text file is available else ignore the files

Will be having many files daily, but need to pull them only if particular text file is in the list (which indicates all files are ready to pull), through GetSFTP Processor.
This process involved pulling files from SFTP and copying to aws-s3.
I know an alternate process to write a script and pull them through the script but I am looking to achieve the same with processors without a script.

Can I delete file in Nifi after send messages to kafka?

Hi I'm using nifi as an ETL tool.
Process IMG
This is my current process. I use TailFile to detect CSV file and then send messages to Kafka.
It works fine so far, but i want to delete CSV file after i send contents of csv to Kafka.
Is there any way?
Thanks
This depends on why you are using TailFile. From the docs,
"Tails" a file, or a list of files, ingesting data from the file as it is written to the file
TailFile is used to get new lines that are added to the same file, as they are written. If you need to a tail a file that is being written to, what condition determines it is no longer being written to?
However, if you are just consuming complete files from the local file system, then you could use GetFile which gives the option to delete the file after it is consumed.
From a remote file system, you could use ListSFTP and FetchSFTP which has a Completion Strategy to move or delete.

ListFile processor, force processor to list full directory everytime

My use case.
Some processing somewhere else add files to some dir (_use_it) -> call my flow using REST -> Now I want my process to read all files from mentioned directory (_use_it).
I want to read all files everytime from this directory, not just changed/added files. I can't start/stop process. This flow has to run as a background process.
I think, I am looking for ListFile processor to run once, then stop, and then when It runs again, it forgets previous state. "some twisted logic" :)
Thanks
1. Using GetFile Processor:
You can use GetFile processor instead of ListFile + FetchFile processors and GetFile processor doesn't store the state.
GetFile processor Gets all the files in the directory every time.
Keep Source File property If true, the file is not deleted after it
has been copied to the Content Repository; this causes the file to be
picked up continually and is useful for testing purposes. If not
keeping original NiFi will need write permissions on the directory it
is pulling from otherwise it will ignore the file.
(or)
2. Using ListFile Processor:
Making use of NiFi RestAPI we can clear the state of list file processor and then processor will list out all files in the directory every time.
Clear state of the processor:
POST
/processors/{id}/state/clear-requests
Before you are starting the Listing all files in the directory flow
Use Rest Api to stop the ListFile processor
Clear the state of ListFile processor
Start the ListFile processor.
Refer to this and this links to STOP the processor via RestApi

Elasticsearch - how to store scripts in config/scripts directory

I'm trying to experiment with using scripts in the config/scripts directory. The Elasticsearch docs here say this:
Save the contents of the script as a file called config/scripts/my_script.groovy on every data node in the cluster:
This seems like it's probably really easy, but I'm afraid I don't understand how exactly to put a groovy file "on every data node in the cluster". Would this normally be done through the command line somehow, or can it be done by manually moving the groovy file (in Finder on OSX for example)? I have a test index, but when I look at the file structure on the nodes I'm confused where to put the groovy file. Help, pretty please.
You just need to copy the file to each server running elasticsearch. If you're just running elasticsearch on your computer then go to the folder you've installed elasticsearch into and add copy the file into config/scripts in there (you may have to create the folder first). Doesn't matter how the file gets there.
You should see an entry in the logs (or the console if you are running in the foreground) along the lines of
compiling script file [/path/to/elasticsearch/config/scripts/my_script.groovy
This won't show up straightaway - by default elasticsearch checks for new/updated scripts every 60 seconds (you can change this with the watcher.interval setting)
Since file scripts are deprecated (elastic/elasticsearch#24552 & elastic/elasticsearch#24555) this aproach is not going to work anymore.
API it's the only way.

Locking of file by specific set of processes

I am facing a scenario where I have to allow access to a file for multiple instances of the same executable, but deny access to the file to all other executables.
For example, if I have a file foo.txt and an executable proc.exe then any number of prox.exe instances should be able to access and modify foo.txt but no other process should be able to access or modify this file.
You can't do this based directly on which executable a process is running. However, you can make your processes co-operate with one another, so that the only processes that can access the file are those that know how to do it.
One particularly simple approach would be to create a named file mapping object for the file using CreateFileMapping(). Only processes that know the name of the file mapping would be able to access it. However, you would then only be able to access the file via memory mapping, not via normal I/O functions.
DuplicateHandle() provides another option, but because the duplicated handle shares a single file object you need to be very careful how you use it. Overlapped I/O is probably the safest approach, as it explicitly supports multiple simultaneous operations on the same object.

Resources