We have a lot of batch jobs for our service, executed from one machine, which is now running out of CPU resources.
Most of these jobs are pretty simple. For example, once every 5 minutes, query our database and find data that need to be processed, then process these data, the result is written back to the database.
The trick is that all of these jobs can only be run with one instance at a time, because if two instances running at the same time we will have race conditions and generate duplicate results.
Is Apache Mesos the right solution for us to replace the job server? That is, can we create a lot of small frameworks, each one is a scheduled job, and have Chronos to trigger each of them with a time interval? Can we gurantee that when a job is triggered, only once instance is running?
Yes Chronos on Mesos would do the trick. To be precise, you don't 'create a lot of small frameworks' but jobs (in Chronos); Chronos being one of the Mesos frameworks, there are many others such as Marathon or Cook.
The pattern would be as follows (for a containerized job), say in a doc called test.json:
{
"name": "test",
"cpus": 0.1,
"mem": 100,
"shell": true,
"command": "echo I AM DOING SOME SERIOUS WORK",
"async": false,
"container": {
"type": "DOCKER",
"image": "ubuntu:14.04"
},
"schedule": "R/2016-05-23T17:00:00Z/PT1M",
"owner": "michael.hausenblas#dcos.io"
}
Now, first you'd register the Chronos job (I'm using http here but also works with, say, curl):
$ http POST http://$CHRONOS_NODE:8080/service/chronos/scheduler/iso8601 < test.json
The above job would execute from 5pm on today every minute and if you want to you can trigger it manually like so:
$ http PUT http://$CHRONOS_NODE:8080/service/chronos/scheduler/job/test
Last but not least, Chronos constraints allow you to influence placement.
You may want to look into using workflow management / job orchestration tools like Luigi or Airflow. You can define dependencies and have jobs only run after the jobs they depend on have completed.
Related
I'm looking for analogue of nomad stop <job> command which I can schedule from the job itself. Is there such a thing in Nomad? I failed to find anything which can do it. It looks like Nomad fundamentally only starting jobs, but stopping them is not something which can be specified in the definition.
The only idea I have is to have two jobs, one to start and another is to stop. The 'stop' one will use command line to stop the other one. Is that the right way? Wondering how everyone else is doing this.
You can always just have your job exit with code 0, jobs don't need to run forever. Nomad will then mark the job as "complete" and you're done.
If the job itself can't control its own exit then yes, you might need to try and set up some external job that can make api calls to stop the task, but I'd recommend trying to just have the job exit when it has completed its task.
Nomad have different schedulers for different type of jobs.
For you requirement, you can use Nomad Batch job.
The default job type is service so if you don't mention any job type it will run as a service job which is expected to not exit unless stopped explicitly. You can mention the job type in the Nomad job file as follows:
job "sample" {
type = "batch"
# ...
}
According to the business logic of my Spring Boot application with Quartz Scheduling and MongoDB as Job persistent storage, every user of the system can create the postponed job that must be executed at some point in time. The user chooses the time when it must be executed.
Right now I'm thinking about the approach where every user will create a dedicated JobDetail for every postponed job, something like this:
schedulerFactoryBean.getScheduler().addJob(jobDetail(), true, true);
The issue I can potentially see here, that with this approach I can quickly create thousands of jobs in Quartz scheduler. Previously I never scheduled such amount of jobs in Spring Scheduling with Quartz and don't know how the system will handle it. Is it a good idea to implement the system in such way and will Spring Scheduling Quartz handle such amount of jobs without problems?
Yes, Quartz itself can handle thousands of jobs and triggers without any issues.
If you are going to have many jobs executing concurrently, just make sure that you configure Quartz with a sufficient number of worker threads. The number of worker threads should be typically equal to the maximum number of jobs that can be running concurrently + some small buffer (10% or so) just in case.
From what you write I assume that your jobs will be one-off jobs, i.e. each job will be executed only once. If that is the case, Quartz can automatically discard your jobs as soon as they finish executing unless your jobs are marked as durable. Quartz automatically removes non-durable jobs if they are not scheduled to run in the future. This feature may help you reduce the total number of registered jobs.
I hope this helps. If not, please ask.
I have a sequence of mapreduce jobs that need to be run. I was wondering if there is any advantage of using Oozie for that, instead of having "one big driver" that will run that sequence?
I know that Oozie can be used to run multiple actions of different type, e.g. pig script, shell script, mr job, but I'm concretely interested should I split my two jobs and run them using Oozie, or have a single jar to do that?
Oozie is a scheduler - crude, poorly documented, but a scheduler.
If you don't need scheduling per se, or if CRON on an edge node is sufficient
if you want to handle your workflow logic by yourself (e.g. conditional
branching, parallel executions w/ waiting for stragglers, calling
generic sub-workflows w/ ad hoc parameters, e-mail alerts on errors,
<insert your pet feature here>) or don't need any fancy logic
if you handle your executions logs and state history by yourself, or don't care about history
... well, don't use a scheduler.
PS: you also have Luigi (Spotify) and Azkaban (LinkedIn) as alternative Hadoop schedulers.
[edit] extra point to consider: if your "driver" crashes for whatever reason, you may not have a chance to send an alert; but if run from Oozie, the crash will be detected eventually (may take as much as 30 min. in a corner case e.g. AM job self-destruction due to YARN RM failover)
It seems that all jobs are enqueued, and only one will run at a time. How can we run more than one?
Tower is designed to parallelize jobs, but there are a couple of cases where it will not.
If you have your inventory or SCM set up to "update on launch" with no cache or the cahche has expired, then any additional jobs will be stuck pending behind the inventory or SCM update. The inventory and SCM will not update until after the currently running job is done.
If you are trying to run multiple jobs against the same host: Tower will not run multiple jobs against the same host at the same time in order to avoid race conditions. (localhost is a possible exception). If you need multiple jobs to run against the same host at the same time then you need to create two inventories and put that host in both inventories, running the two jobs against different inventories. In this situation, Tower does not know that you are running against the same host.
Jobs which share the same Inventory or SCM source can not run at the same time.
Suppose you have a job comprised of three tasks:
task 1: "do x", task 2: "do y", task 3: "do z"
With ansible "do x" will run on all the servers, then "do y" will run on all the servers, then "do z" will run on all the servers.
Also, I said "all serves" but in fact it maxes out at the ansible "forks" value, which defaults to 5. In my 100 server enviroment I set this value to 20. more on this here: http://docs.ansible.com/intro_configuration.html#forks
Remember the strength of ansible is doing a job ( a collection of tasks) on many machines at the same time. If what you want is to run the same task many times on a single machine, then you want something like fork, or parallel.
In fact Ansible will try to run "do x" as many times as it can across many machines. You can adjust this behavior having the whole job run on a portion of machines before it gets started on more machines with the "serial" keyword (http://docs.ansible.com/playbooks_delegation.html#rolling-update-batch-size).
Not the subtle difference between forks, and serial.
forks is "per task"
serial is "per job" ( collection of tasks )
David Thornton
Edit:I re-read your question. This is about running more than one job at a time, not running more than on task in a job. So I think you are correct for ansible-awx but not for the command line. Via the web interface you can submit a job to the job queue, but you can't make ansible-awx run more than one task at a time. I think. However via command line, if you open more than one window you can run multiple ansible-playbooks at the same time. Do you have an ansible support account? Those guys are great IMHO, they have taken a lot of time to answer my questions ( like your question ).
Simultaneous jobs can be executed from Tower. Job templates have "Enable Concurrent Jobs" option. See section "15.4. Job Concurrency" at http://docs.ansible.com/ansible-tower/latest/html/userguide/jobs.html.
If i have 3 different tasks on a single server running its called synchronous mode management, 3 tasks will be assigned to a single job ID , and each tasks executes one after the other were it consumes lots of time.
In Ansible version later than 2.5 we can get 3 job ID for 3 different tasks , and start executing at a same time were we can save a huge time.This type is called asynchronous mode.
I have a pool of Jobs from which I retrieve jobs and start them. The pattern is something like:
Job job = JobPool.getJob();
job.waitForCompletion();
JobPool.release(job);
I get a problem when I try to reuse a job object in the sense that it doesn't even run (most probably because it's status is : COMPLETED). So, in the following snippet the second waitForCompletion call prints the statistics/counters of the job and doesn't do anything else.
Job jobX = JobPool.getJob();
jobX.waitForCompletion();
JobPool.release(jobX);
//.......
Job jobX = JobPool.getJob();
jobX.waitForCompletion(); // <--- here the job should run, but it doesn't
Am I right when I say that the job doesn't actually run because hadoop sees its status as completed and it doesn't event try to run it ? If yes, do you know how to reset a job object so that I can reuse it ?
The Javadoc includes this hint that the jobs should only run once
The set methods only work until the job is submitted, afterwards they will throw an IllegalStateException.
I think there's some confusion about the job, and the view of the job. The latter is the thing that you have got, and it is designed to map to at most one job running in hadoop. The view of the job is fundamentally light weight, and if creating that object is expensive relative to actually running the job... well, I've got to believe that your jobs are simple enough that you don't need hadoop.
Using the view to submit a job is potentially expensive (copying jars into the cluster, initializing the job in the JobTracker, and so on); conceptually, the idea of telling the jobtracker to "rerun " or "copy ; run ", makes sense. As far as I can tell, there's no support for either of those ideas in practice. I suspect that hadoop isn't actually guaranteeing retention policies that would support either use case.