I am running 1000 jobs on a cluster, using a sbatch job array. I've set up my code such that if the job array index is set to 0, precomputations are executed and saved to file; the jobs 1-999 then access these precomputations.
The precomputations in job 0 take much less time than any of jobs 1-999.
Instead of manually submitting job 0 with some time limit T1, waiting for it to finish, then submitting jobs 1-999 with a job array and time limit T2, I would like to submitting everything from the same .sh file and:
force jobs 1-999 to start only when job 0 is done
enforce job 0 has time limit T1, and jobs 1-999 have limit T2.
Is something along these lines possible?
Related
I have a case where I want the job to run from Day 1 to Day 5 of a
month and when it's predecessor is success.
But what's happening
is the job runs once a day after start time even when predecessor
run multiple times through the day( I want my job to also refresh as
many times as predecessor do)
Job A
date_conditions: 1
run_calendar: D1_D5
start_times: "00:00"
condition: s(Job B)
I tried removing start_time but with calendar it takes a default start time of "00:00" anyway and gets scheduled for next day after first run.
Thanks in advance.
If I understand you correctly, you want Job A to run only on Days 1-5, but every time Job B runs on those days.
You could update your script logic for Job A so that it only truly runs on Day 1 to Day 5 (the rest of the month would be some kind of dummy script), then change the dependency so it's just dependent on Job B.
If you want to use the scheduler, recommend you clone Job B and schedule that to run on Days 1-5. Then change Job A to be dependent on the Job B clone.
I want to use queue job to update data by time create to database
Example . i have table job_history
I will insert multiple record into job_history table by status (status = 0 not finish)
I want to handle all record with status = 0 by queue (time create ascending)
(It mean after processing the record 1 completed (update record with status = 1 finished or timeout by 60s )
then automatically next to the record 2,3... until the end (update status=1)
And when i create new record into job_history table then queue always listen to continue handle with status =0
=> I can handle by cronjob (but cron job only configured at least once a minute => That interrupts the work) ,
I want to handle if record 1 finished will continue 2 ( not wating 1 minute by cronjob)
I don't know laravel queue can do this job not ?
and how to setup it ?
In production in order tu consume queue jobs you can use supervisor.
You can read this doc : https://laravel.com/docs/7.x/queues#supervisor-configuration
queue:work handles a job at a time, sequentially, as long as there are jobs to do in the queue. You don't have to do anything special to achieve what you're asking for.
Also, you don't need (strictly) a supervisor to run queues, those can be handled with a cron and a relatively simple script to check if the worker is running and launching or restarting it as needed.
Yes, you can handle this situation by excuting this command:
queue:work --stop-when-empty
With the cron, it will check every minute if the queue is not empty. Then it will do all existing jobs in the queue. Then terminate.
I have a MR job which runs fine in the cluster.
After the job completion I'm able to get YARN logs but I couldn't find the MR job counters like no of input records, output records.
Is it possible to get that information after job completion?
For instance, if a folder reaches 100 MB then a spark job should be triggered.
I read about the dirSize hdfs el function in oozie, but I'm not sure how to use it. Does it trigger the job when the folder reaches 100 MB, or does it have to be checked periodically in, let's say, every 2 minutes?
1 option for you is to run a oozie coordinator periodically (say for every 2min) to check on the file size, if it attains the specified limit you can trigger the spark job.
i am using hadoop to process bigdata, i first load data to hdfs and then execute jobs, but it is sequential. Is it possible to do it in parallel. For example,
running 3 jobs and 2 process of load data from others jobs at same time on my cluster.
Cheers
It is possible to run the all job's in parallel in hadoop if your cluster and jobs satisfies the below criteria:
1) Hadoop Cluster should have capability to run reasonable number of map/reduce task(depends on jobs) in parallel(i.e. should have enough map/reduce slots).
2) If jobs that is currently being run , depends on the data which is loaded through another process, we cannot run data load and job in parallel.
If you process satisfies the above condition, you can all the jobs in parallel.
Using Oozie you can schedule all the process to run in parallel. Fork and Join properties in Oozie allows you to accomplish the task to run in parallel.
If your cluster has enough resources to run the jobs in parallel, then yes. But be sure that the work of each job, doesn't interfere with the others. Like load the data at the same time that another job in execution should be using it, that won't work as you expected.
If there is not enough resources, then hadoop will enqueue the jobs until the resources are available, depending on the Scheduler configured.