I would like to setup a FileWatcher job that looks for multiple signal files to be present before it kicks off. Is there any way to check for the presence of multiple files before the child job is triggered ? Or would I have to create multiple file watcher jobs, one for each file ?
File Trigger (R11) or File watcher (legacy) jobs can only take one value in the watch_file attribute. R11 does allow for wild cards, but that is probably not what you want. I would create a separate job for each signal file, put the jobs in a box, and run the downstream job on the success of the box.
Related
I am new to Nifi.My requirement is to trigger Nifi process group using external scheduling tool called Control M. I tried using shell script to start and.stop the process group using curl command. Process group will fetch data from text file and writes into a database but unable to determine when the process group gets completed because I could see status like Started, Running and Stopped but not Completed state. Struck with this issue and need your valuable inputs on this of how to determine all the records got inserted into database placed inside process group
NiFi is not a batch 'start & stop' style tool. NiFi is built to work with continuous streams of data, meaning that flows are 'always on'. It is not intended to be used with batch schedulers like ControlM, Oozie, Airflow, etc. As such, there is no 'Completed' status for a flow.
That said, if you want to schedule flows in this way, it is possible - but you need to build it in to the flow yourself. You will need to define what 'Completed' is and build that logic in your flow - e.g. MonitorActivity after your last processor to watch for activity.
I am working on CakePHP 3.4 project.
I have to execute some command to scan through the files and directories of a particular directory.
This might take long time depending on the size of the directory, therefore I want to run it in background and mark running label in view until it executed successfully.
How can I run a Shell Task in the background from Controller and update database on execution?
I'm new to Shell tasks.
Your thinking along good lines about running this in the background if it is a time consuming task. You will need to use some form of queuing system that allows you to add jobs to a queue that can then get run in the background by running the queue from a cronjob. Take a look at the Queue plugin for doing this.
You'll basically need to create a queue task that contains the functionality that you need running in the background and then add a job to the queue that will run that task in the background. The Queue plugin's documentation shows how to do this and there are a load of example queue tasks included with the plugin.
If you need to indicate the status of the queued job you could save the job's ID in a session and check if it is complete when loading a page.
You can dispatch a Shell task from the controller. If you want to run this in the background you could, for example, run this controller action via JavaScript/Ajax.
// maybe this task runs looooong
set_time_limit(0);
$shell = new ShellDispatcher();
$output = $shell->run(['cake', 'bake', 'model', 'Products']);
if ($output === 0) {
$this->Flash->success('Yep!');
} else {
$this->Flash->error('Nope!');
}
But you could indeed have googled this at least. ;-)
EDIT Forget this one, go for drmonkeyninja’s answer.
Say I want to run a job on the cluster: job1.m
Slurm handles the batch jobs and I'm loading Mathematica to save the output file job1.csv
I submit job1.m and it is sitting in the queue. Now, I edit job1.m to have different variables and parameters, and tell it to save data to job1_edited.csv. Then I re-submit job1.m.
Now I have two batch jobs in the queue.
What will happen to my output files? Will job1.csv be data from the original job1.m file? And will job1_edited.csv be data from the edited file? Or will job1.csv and job1_edited.csv be the same output?
:(
Thanks in advance!
I am assuming job1.m is a Mathematica job, run from inside a Bash submission script. In that case, job1.m is read when the job starts so if it is modified after submission but before job start, the modified version will run. If it is modified after the job starts, the original version will run.
If job1.m is the submission script itself (so you run sbatch job1.m), that script is copied in a spool directory specific to the job so if it is modified after the job is submitted, it still will run the original version.
In any case, it is better, for reproducibility and traceability, to make use of a workflow manager such as Fireworks, or Bosco
We're using the Ruby gem whenever to manage large batches of import jobs. But what if a file is still being imported when the next cron job occurs?
For example:
12am: whenever starts an import cron job for import.csv
2am: import.csv is still being imported, but the next cron job is scheduled in whenever.
Would whenever skip that file or try to run it again? Any suggestions to make sure it doesn't try to process the same file twice?
Whenever is merely a frontend for the crontab. Whenever doesn't actually launch any of the processes, it writes a crontab that handles the actual scheduling and launching. Whenever cannot do what you're asking.
The crontab cannot do what you want either. It launches the process and that's it.
You need to implement the checking yourself in the process launched by cron. A common way of doing this could be a lockfile, and I'm sure there are libraries for this (ie http://rubygems.org/gems/lockfile).
Depending on your situation you might be able to create other checks before launching the import.
Well, this isn't really an issue of whenever
However, you could rename the file you want to import when you start processing (12am to 2am is a reasonable amount of time to do that) and move it to an archive directory once you are done processing so there is no confusion.
The next time the task runs it should look for all files that do not match a naming pattern (as already suggested in one of the comments)
And you might want to add an additional task that checks for imports that might have failed (e.g. a file has a naming pattern including the exact time but after a whole day it is still not archived) and either create some kind of notification or just trigger the task again/rename the task so it is picked up again (depending on how well your rollback works)
By default, hadoop map tasks write processed records to files in temporary directory at ${mapred.output.dir}/_temporary/_${taskid} . These files sit here until FileCommiter moves them to ${mapred.output.dir} (after task successfully finishes). I have case where in setup() of map task I need to create files under above provided temporary directory, where I write some process related data used later somewhere else. However, when hadoop tasks are killed, temporary directory is removed from HDFS.
Anyone knows if it is possible to tell Hadoop to not delete this directory after task is killed, and how to achieve that? I guess some property should be provided that I can configure.
Regards
It's not a good practice to depend on temporary files, whose location and format can change anytime between releases.
Anyway, setting mapreduce.task.files.preserve.failedtasks to true will keep the temporary files for all the failed tasks and setting mapreduce.task.files.preserve.filepattern to regex of the ID of the task will keep the temporary files for the matching pattern irrespective of the task success or failure.