Capistrano intervals between restarts of the servers to ensure service continuity - passenger

I know the best practice for pushing changes to production is to have sets of servers A and B, to have A serve the website for the client, to push the update on B and then switch A<->B to ensure the continuity of services. But this feel kinda hard to implmeent with Capistrano (?)
I currently have a pool of autoscaled servers on the Amazon cloud. Using capistrano, my deploy command will deploy the update on all the servers, and restart them all at the same time. During the period when passenger restarts itself, there is a downtime on my production server (and restarting can take up to 10 sec so it's a problem).
In order to avoid this, I'd like to restart my servers one at a time, and to wait x seconds before restarting the next server (I don't mind if I have 2 different versions of the code online, the target scenario I have in mind, is deploying a small hotfix)
Is there a way to override Capistrano restart task so as to wait some time before starting the command on the next server ?

This is built in:
on :all, in: :sequence, wait: 15 do
# Your restart task

I am actually using Capistrano-Passenger for restarting my server, and I just noticed there's a set :passenger_restart_wait, 5 command which seems to do that already !
From the gem readme


if an aws spot instance is stopped by AWS and then restarts will it just start where it left off?

I am running luigi, a pipeline manager which is processing 1000 tasks. Currently I poll for the AWS termination notice. If it is present then I requeue the job; wait 30 minutes; then launch a new server starting all the tasks from scratch. However sometimes it restarts the same job multiple times which is inefficient.
Instead I am considering using create_fleet with InstanceInterruptionBehaviour=Stop? If I do this then when it restarts will it still be running the luigi daemon and retain the state of all the tasks?
All InstanceInterruptionBehaviour=Stop does is effectively shutdown your EC2 instance rather than terminate it. Since the "persistent" setting is required in addition to EBS storage" you will keep all the data currently on the attached EBS volumes at the time of the instance stop.
It is completely dependent on the application itself (Luigi in this case) to be able to store the state of its execution and pick back up from where it left off. For one, you'll want to ensure you enable the service daemon to automatically start upon system start (example):
sudo systemctl enable yourservice

TeamCity with AWS cloudformation stuck on AgentService

Followed TeamCity's description of running a TeamCity build server on AWS with a cloudformation template. Launched it, it gets stuck at AgentService (Resource creation initiated). Waited for half an hour, no progress.
Resources tab shows the following:
What am I doing wrong here?
(For me) this typically happens if the service cannot be started for some reasons. For instance if the cluster does not have enough suitable instances to start your service or for some other reason.
For diagnostic, check your service in the ECS cluster and there check events and in tasks of your service, check stopped tasks (and reasons they were stopped).
Got a tip from a colleague that if you are creating a CF template based service, it may take up to 3(!) hours. Tried again today, after 3 hours it was up and running.
The reason for this is the setup of the ECS, which involves DNS setup for an internet facing service.

Laravel Horizon inactive and still processing

I run my Application on Kubernetes.
I have one Service for requests and one service for the worker processes.
If I access the Horizon UI it often shows the Inactive Status, but there are still jobs being processed by the worker. I know this because the JOBS PAST HOUR are getting more.
If I scale up my worker service there will be constantly "failing" Jobs with this exception Illuminate\Queue\MaxAttemptsExceededException.
If I connect directly to the pods and run ps aux I will see that there are horizon instances running.
If I connect to a pod on which the worker is running and execute the horizon:list command it tells me that one (or multiple) Masters are running.
How can I further debug this?
Laravel version: 5.7.15
Horizon version: 2.0.0
Redis version: 3.2.4
The issue was that the Server Time was out of Sync so the "old" ones got restartet all the time

DC/OS (mesos/marathon) how set time to start killed instance of aplication

I have install DC/OS (3master and 7slave server - all Centos7)
I saw problem - when one of slave server shut down - mesos/marathon start killed instance of application after 5 minutes.
For example - I run in mesos/marathon 8 instance simple web application. When I shut down or deactivate network interface of one slave server marathon show that some instancje are killed. From this moment mesos/marathon wait 5 minutes and start killed instance to another online slave server.
My question is - how can I change this time? 5 minutes is to long. I read documentation of DC/OS but I can't find variable responsible for this.
I will be very thankful for your help.
You can have a at the Marathon command-line flags. Based on your description, I guess the default for either task_launch_timeout or scale_apps_interval could be responsible for this.
I'm unsure though if this can be configured on the fly, or during installation in DC/OS. I saw that there's a quite recent enhancement request to Make Marathon flags passable via environment variables.

Celery, Resque, or custom solution for processing jobs on machines in my cloud?

My company has thousands of server instances running application code - some instances run databases, others are serving web apps, still others run APIs or Hadoop jobs. All servers run Linux.
In this cloud, developers typically want to do one of two things to an instance:
Upgrade the version of the application running on that instance. Typically this involves a) tagging the code in the relevant subversion repository, b) building an RPM from that tag, and c) installing that RPM on the relevant application server. Note that this operation would touch four instances: the SVN server, the build host (where the build occurs), the YUM host (where the RPM is stored), and the instance running the application.
Today, a rollout of a new application version might be to 500 instances.
Run an arbitrary script on the instance. The script can be written in any language provided the interpreter exists on that instance. E.g. The UI developer wants to run his "check_memory.php" script which does x, y, z on the 10 UI instances and then restarts the webserver if some conditions are met.
What tools should I look at to help build this system? I've seen Celery and Resque and delayed_job, but they seem like they're built for moving through a lot of tasks. This system is under much less load - maybe on a big day a thousand hundred upgrade jobs might run, and a couple hundred executions of arbitrary scripts. Also, they don't support tasks written in any language.
How should the central "job processor" communicate with the instances? SSH, message queues (which one), something else?
Thank you for your help.
NOTE: this cloud is proprietary, so EC2 tools are not an option.
I can think of two approaches:
Set up password-less SSH on the servers, have a file that contains the list of all machines in the cluster, and run your scripts directly using SSH. For example: ssh "ls -la". This is the same approach used by Hadoop's cluster startup and shutdown scripts. If you want to assign tasks dynamically, you can pick nodes at random.
Use something like Torque or Sun Grid Engine to manage your cluster.
The package installation can be wrapped inside a script, so you just need to solve the second problem, and use that solution to solve the first one :)
