Watch new files in directory NIFI - apache-nifi

I'm having a use case where I have new files everyday at differents moments like every hour or two hours so I need to watch a directory in my folder, and on adding new files it triggers an event which sends those new files paths to my webservice on NIFI , any idea how to implement this and what tool to use for this ?
Or maybe this is not the best approach ?

Take a look at the ListFile and FetchFile processors:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.ListFile/index.html
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.FetchFile/index.html
Complete NiFi documentation can be found at https://nifi.apache.org/docs.html

If your file is in file sytem then use 'GETFILE' processor which on adding new file on provided 'input directory' triggers an event and immediately feed data into NIFI without any delay.
If your requirement is to schedule it like every hour or any specific time then use 'Scheduling' tab present on each processor's configuration and schedule it using 'Cron-Driven' strategy and set cron for every hour like this
*/60 * * * *?
If your file is in S3 bucket then you have to use SQS queue notification using 'GETSQS' processor documented in detailed in below link
http://crazyslate.com/apache-nifi-intergration-with-aws-s3/
https://community.hortonworks.com/content/idea/76919/how-to-integrate-aws-sqs-queue-with-nifi.html

Related

How to make event based flow run only once in a while

I have made a flow that would trigger when a file is created in a folder (lets call it event folder). Based on that the flow would create another file in a different folder and send me a message that a new file has been created.
Now the event folder could have 1 or multiple files generated at once. My flow would trigger for each file created and spam me for all the files at once. I want only to have one message for any number of files created within the span of 5 minutes. Is there a way to do that?
Here is my flow
Looks like you'll need to store a 'new files created' status in some variable (possible hints at https://learn.microsoft.com/power-automate/create-variable-store-values ).
Then create some other flow, scheduled to run every 5 minutes (https://learn.microsoft.com/power-automate/run-scheduled-tasks) to check the status, optionally send a message and clear the status to some 'nothing to do'.
Using When a file is created (properties only) block with Split On setting turned off and Concurrency Control turned on & Degree of parallelism set to 1 does the trick
See attached image below

NiFi how to release flow file until a process downstream is finished

I am designing a data ingestion pattern using NiFi. One process needs to stop releasing flow files until a process downstream has finished processed. I tried to use wait and notified and have not made any success. I am hoping if the queue size and back pressure can be set across a few processors.
Similarly if there's a way I can implement logic: Don't allow flow files go in if there is one currently processing between multiple processors.
Any help is appreciated
You need a combination of MonitorActivity with executestreamcommand (with a python "nipyapi" script).
I have a similar requirement in one of my working flows.
You will need to install python lib nipyapi first and create this script on the nifi box.
from time import sleep
import nipyapi
nipyapi.utils.set_endpoint('http://ipaddress:port/nifi-api', ssl=False, login=False)
## Get PG ID using the PG Name
mypg = nipyapi.canvas.get_process_group('start')
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=True) ## Start
sleep(1)
nipyapi.canvas.schedule_process_group(mypg.id, scheduled=False) ## Stop
I will put the template in the img in the link bellow, see the configuration on the monitor-activity processor - it will generate a flow if not activity is happening for 10 sec(you can play with the times thou).
Download template
Note: this is not a very good approach if you have high latency requirements.
Another idea would be to monitor the aggregate queue in the entire flow and if queue is zero then you restart start flow. (this would be very intense if you have a lot of connections)
I was able to design a solution within NiFi. Essentially using generate flow file as a signal (Only run once ever). The trick is have the newly generated flow file to merge with the original input flow through defragmentation. And every time after the flow has finished, the success condition will be able to merge with the next input flow file.
Solution Flow

How to architecture file processing in laravel

I have task observe folder where files are coming from SFTP. File are big and processing one file is relatively time consuming. I am looking for best approach to do it. Here are some ideas how to do it, but I am not sure what is the best way.
Run scheduller each 5 min to check for new files
For each new file trigger event that there is new file.
Create listener which will listen for this event and which will using queues. In the listener for new files copy new file in the processing folder and process it. When processing of new files start insert record in the DB with status processing. When processing is done change record status and copy file to processed folder.
I this solution I have 2 copy operations for each file. This is because it is possible if second scheduler executes before all files are processed than some files could overlap in 2 processing jobs.
What is the best way to do it? Should I use another approach to avoid 2 copy operations? Something like to put database check during scheduler execution to see if the file is already in the processing state?
You should use the ->withoutOverlapping(); as stated in the manual of task Scheduler here.
Using this you will make sure that only one instance of the task run at any given time.

How to delete input files after successful mapreduce

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie/Hadoop: How do I define an input dataset when it's more complex than just a static file?

I'm trying to run an existing Hadoop job using Oozie (I'm migrating from AWS).
In AWS Mapreduce I programmatically submit jobs, so before the job is submitted, my code programmatically find the input.
My input happens to be the last SUCCESSFUL run of another job. To find the last SUCCESSFUL run I need to scan an HDFS folder, sort by the timestamp embedded in the folder naming convention, and find the most recent folder with an _SUCCESS file in it.
How to do this is beyond my oozie-newbie comprehension.
Can someone simply describe for me what I need to configure in Oozie so I have some idea of what I'm attempting to reach for here?
Take a look to the following configuration for oozie: https://github.com/cloudera/cdh-twitter-example/blob/master/oozie-workflows/coord-app.xml
There is a tag called "done-flag" there you can put the _SUCCESS file in order to trigger a workflow or for your case a map reduce job. There are also parameter for scheduling the job
${coord:current(1 + (coord:tzOffset() / 60))}
....

Resources