For spring quartz in cluster mode, if for one instance the scheduler is put on stand-by, will it keep all instances on stand-by too? - spring

We are using a quartz scheduler for scheduling a few jobs.
We are multiple VMs so we have configured quartz in clustered mode using MySQL job store.
Now we are migrating to Kubernetes. So there will be multiple pods that will handle all the incoming requests, but there will be worker pods that will only run these schedulers. The problem is we don't want these main pods to execute these quartz jobs.
I was thinking of disabling schedulers for these main pods and enabling only for worker pods, In that way we can achieve what is required.
So the question is if we disable or put schedulers on standby for main pods will it also put other (workers) instances schedulers also on standby as all of them are running in cluster mode?
If anyone has a better solution, Please help.
I don't have any solution in my mind at all.
not even a workaround.
If we don't find the solution then we have to stick with our older VMs and cannot migrate to Kubernetes.

Related

Spring boot aws cluster instance scheduler

I have a spring-boot application, which takes request from users and save data in db.
There are certain integration calls need with the data saved. So I thought a scheduler task for every 15 mins which should pick this data and do necessary calls.
But my application is being deployed in AWS EC2 on 2 instances. So this scheduler process will run on both the instances, which will cause duplicate integration calls.
Any suggestions on how this can be achieved to avoid duplicate calls.
I haven't had any code as of now to share.
Please share your thoughts...Thanks.
It seems a similar question was answered here: Spring Scheduled Task running in clustered environment
My take:
1) Easy - you can move the scheduled process to a separate instance from the ones that service request traffic, and only run it on one instance, a "job server" if you will.
2) Most scalable - have the scheduled task on two instances but they will somehow have to synchronize who is active and who is standby (perhaps with a cache such as AWS Elasticache). Or you can switch over to using Quartz job scheduler with JDBCJobStore persistence, and it can coordinate which of the 2 instances gets to run the job. http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/tutorial-lesson-09.html

Apache Aurora cron jobs are not scheduled

I setup a Mesos cluster which runs Apache Aurora framework, and i registered 100 cron jobs which run every min on a 5 slave machine pool. I found after scheduled 100 times, the cron jobs stacked in "PENDING" state. May i ask what kind of logs i can inspect and what is the possible problem ?
It could be a couple of things:
Do you still have sufficient resources in your cluster?
Are those resources offered to Aurora? Or maybe only to another framework?
Do you have any task constraints that prevent your tasks from being scheduled?
Possible information source:
What does the tooltip or the expanded status say on the UI? (as shown in the screenshot)
The Aurora scheduler has log files. However normally those are not needed for an end user to figure out why stuff is stuck in pending.
In case you are stuck here, it would probably be the best to drop by in the #aurora IRC channel on freenode.

running multiple instances of a spark app on mesos through marathon

I am trying to run a spark streaming app through marathon on mesos and this job eventually stores some counts into an instance of cassandra. My question is should I set number of instances (on marathon) for this app to 2 (for HA); however, the issue is wouldn't the 2nd instance be just a replica of the first one and processing and results would be duplicated?
No you don't set the number of instances to 2 for HA. Marathon will re-start any app that due to whatever reasons has gone down. It is a good practice to implement health checks, though.

How to recover Mesos executor after Mesos framework failure?

My scenario is that a framework is running on server A. It has an executor on server B running a task (a long running web service with a long initialization time). Server A is shutdown. The framework is then restarted somewhere else in the cluster.
Currently, after the restart the new framework registers a new executor which runs a new task. After some time, the Mesos master deactivates the old and no-longer-running framework which in turn kills the old but still-running executor and its task.
I would like the new framework to re-register the old executor rather than register a new one. Is this possible?
This on the Mesos forum answers my question:
http://www.mail-archive.com/user%40mesos.apache.org/msg00069.html
Included here for reference:
(1) One thing particular I found unexpected is that the executors are
shutdown if the scheduler is shutdown. Is there a way to keep executors/tasks
running when the scheduler is down? I would imagine when the scheduler comes
back, it could reestablish the state somehow and keep going without
interrupting the running tasks. Is this a use case that mesos is designed for?
You can use FrameworkInfo.failover_timeout to tell Mesos how long to wait for the framework to re-register before it cleans up the
framework's executors and tasks.
Also, note that for this to work the framework has to persist its
frameworkId when it first registers with the master. When the
framework comes back up it needs to reconnect by setting
FrameworkInfo.framework_id = persisted id.

What keeps the cluster resource manager running?

I would like to use Apache Marathon to manage resources in a clustered product. Mesos and Marathon solves some of the "cluster resource manager" problems for additional components that need to be kept running with HA, failover, etc.
However, there are a number of services that need to be kept running to keep mesos and marathon running (like zookeeper, mesos itself, etc). What can we use to keep those services running with HA, failover, etc?
It seems like solving this across a cluster (managing how many instances of zookeeper, etc, and where they run and how they fail over) is exactly the problem that mesos/marathon are trying to solve.
As the Mesos HA doc explains, you can start multiple Mesos masters and let ZK elect the leader. Then if your leading master fails, you still have at least 2 left to handle things. It is common to use something like systemd to automatically restart the mesos-master on the same host if it's still healthy, or something like Amazon AutoScalingGroups to ensure you always have 3 master machines even if a host dies.
The same can be done for Marathon in its HA mode (on by default if you start multiple instances pointing to the same znode). Many users start these on the same 3 nodes as their Mesos masters, using systemd to restart failed Marathon services, and the same ASG to ensure there are 3 Mesos/Marathon master nodes.
These same 3 nodes are often configured to be the ZK quorum as well, so there are only 3 nodes you have to manage for all these services running outside of Mesos.
Conceivably, you could bootstrap both Mesos-master and Marathon into the cluster as Marathon/Mesos tasks. Spin up a single Mesos+Marathon master to get the cluster started, then create a Mesos-master app in Marathon to launch 2-3 masters as Mesos tasks, and a Marathon-master app in Marathon to launch a couple of HA Marathon instances (as Mesos tasks). Once those are healthy, you can kill the original standalone Mesos/Marathon master and the cluster would failover to the self-hosted Mesos and Marathon masters, which would be automatically restarted elsewhere on the cluster if they failed. Maybe this would work with ZK too. You'd probably need something like Mesos-DNS and/or ELB to let other services find Mesos/Marathon. I doubt anybody's running Mesos this way, but it's crazy enough it just might work!
In order to understand this, I suggest you spend a few minutes reading up on the architecture and the HA part in the official Mesos doc. There, it is clearly explained how HA/failover in Mesos core is handled (which is, BTW, nothing magic—many systems I know of use pretty much exactly this model, incl. HBase, Storm, Kafka, etc.).
Also, note that—naturally—the challenge keeping a handful of the Mesos masters/Zk alive is not directly comparable with keeping potentially 10000s of processes across a cluster alive, evict them or fail them over (in terms of fan out, memory footprint, throughput, etc.).

Resources