why do cloud build private pools have a long queue time - google-cloud-build

I have set up a private worker pool and I had expected that the queued time for a build would go down. Previously queue time (when only one build is queued) was about 1 minute. I had assumed because I was using shared machines inside GCP to do the build. I therefore expected that a private worker pool would have no queue since I would be the only one building anything. I was surprised to see it took about 1 minute also. I then thought that perhaps the first build would have to spin up a VM and that's why it took so long so I carried out a second build after the first had finished but that also had a queue time of about 1 minute. I don't understand what is going on, 1 minute is quite a long time.

When you use Cloud Build shared pool, you use machine provisioned by Google, up and running and paid by Google. Therefore, when you have a build to run, you pick one active machine in the shared pool and you run your build on it.
With private pool, it's different. The machine are still managed by Google, but the pool is private, dedicated to you. Therefore, Google won't keep VM up and running (and use CPU/Memory) if you run nothing on them (because you pay only when a job is running). So, Google stop the VM.
When you run a job on Cloud Build, a VM is started and your job can start. Similar to Compute Engine, it takes about 1 minutes to provision a VM and to start it.
That being said, your requirement could be a nice feature request: keep warm a number of VMs to prevent this on demand provisioning. Of course, it won't be free, but it will be faster!
You can open a feature request here

Related

Heroku: Prevent worker process from restarting?

I have a Heroku worker setup to do a long running job which iterates over long periods. However whenever I do an update & deploy of other files in the repo this worker restarts, which is annoying, any way to avoid this?
No. This behaviour is part of Heroku's Automatic Dyno Restarting.
You can't work around this. Instead, you need to build all parts of your app to be able to function properly despite the fact that all dynos will restart at least once every 24 hours or so, whether or not you deploy updates in your repo.
Most significantly, you need to build support for Graceful Shutdown into all your processes (e.g. web process and worker processes).

Amazon EC2 - Fast AMI Creation in a Production Environment

We run a server architecture where we have an X number of base servers which are always on. Our servers process jobs sent to them and the vast majority of our job requests come in during the workday. To facilitate this particular spike in volume, we use EC2 auto-scaling.
I prefer to launch servers through auto-scaling with as much of a configured AMI as possible as opposed to launching from a base AMI and installing packages through long Chef or Puppet scripts.
In our current build process, we implement changes to our code base late at night when only our base servers are needed and no servers are launched through auto-scaling. But every once in a while, we'll have a critical bug fix that needs to be implemented immediately during the day.
We have a rather large EBS hard drive associated with our servers (app. 400 GB) and AMI creation of a base server with the latest changes usually takes upwards of one hour. This isn't a problem for late night deployments when no additional servers need to be launched, but causes us to lose valuable time during the day because it prevents us from launching additional servers when the latest AMI isn't ready.
Is there anything out there which can speed up the AMI creation process here? I've heard of Netflix's Aminator and Boxfuse, but are there any other alternatives? Also, how do these services stack up against one another?

How to deploy a new version of binary to a server when it's already running?

I'm trying to put a new version of my webserver (which runs as a binary) on an amazon ec2 instance. The problem is that I have to shut the process down each time to do so. Does anyone know a workaround where I could upload it while the process is still running?
Even if you could, you don't want to. What you want to do is:
Have at least 2 machines running behind a load balancer
Take one of them out of the LB pool
Shutdown the processes on it
Replace them (binaries, resources, config, whatever)
Bring them back up
Then put it back in the pool.
Do the same for the other machine.
Make sure your chances are backward compatible, as there will be a short period of time when both versions run concurrently.

Heroku suitable for app based on long running processes?

I have an app which requires long running processes - typically over 2 hours (recording streaming media). Based on Heroku's website, my worker server running these processes will be restarted randomly, at least once per day.
Is there anyway to control/avoid these restarts, so as not to interrupt my long running processes?
Do other paas providers avoid this issue?
I don't know, How to control/avoid these restarts. I was also going through their documentation, They clearly state that "Dynos are also cycled at least once per day, in addition to being restarted as needed for the overall health of the system and your app."
I think, Dynos restart should only take placed when system behaves unexpected or Dynos are found in crashed state OR In month or week to clear cache memories.
You can try App42 PaaS which monitors your Apps continuously to make sure that they are up and running. If any kontena is found in crashed state, Health Monitor try to bring it back to working state. if unable than that particular kontena is deleted & replaced with a new one.
Disclaimer: I work for App42 PaaS.

Is there a hard limit on how long Azure role startup can take?

Suppose I include a rather long-running startup task into my Azure role - running something like up to several minutes. What happens if the startup task runs "too long".
I'm currently testing on Compute Emulator and observe the following.
I have a 450 megabytes .zip file together with Info-Zip unzip. The startup task unzips the archive. Deployment starts and I look into Task Manager. Numerous service processes start, then unzip.exe is run. After about two minutes all those processes stop and then start anew and unzip.exe starts again.
So it looks like a deployment is allowed to run for about two minutes, then is forcefully reset and started again.
Is this the expected behavior? Does it persist on real cloud? Are there any hard limits on how long a role startup can take? How do I address this situation except moving the unpacking into RoleEntryPoint.OnStart()?
I had the same question, so tried an experiment. I ran a Startup Task - taskType="simple" so that it would block the Roles from beginning to execute - and let it run for 50 hours. The Fabric Controller did not complain and the portal did not show any error. It finished its long "do nothing" loop after the 50 hours was up, then this Startup Task exited, and my Web Role started up fine.
So my emperical test says Startup Tasks can take a long time! At least 50 hours.
This should inform the load balancer that your process is still busy:
http://msdn.microsoft.com/en-us/library/microsoft.windowsazure.serviceruntime.roleinstancestatuscheckeventargs.setbusy.aspx
I have run startup tasks that run for a pretty long time (think 20-30 mins) and the role is simply in a 'Busy' state. I don't think there is a hard limit for how long the role will stay in that state as long as the Startup task is still executing and did not exit with a non-zero return code (in fact, this is a gotcha for most first time startup task creators when they pop a prompt). The FC is technically still running just fine, so there would be no reason to 'recover' the role (i.e. heartbeats are still going).
The dev emulator just notices when the role hasn't started and warns you. If you click the 'keep waiting' option, it will continue to run the Startup task to completion. The cloud does not do this of course (warn you).
Never tried a task that ran super long, so there might be a very long limit. I seem to recall 3 hrs was a magic number in some timeout cases like role recycles, but I have never tried...
There are some heartbeats that the Azure Fabric Agent will do against the role. If these are not acknowledged (say a long-running blocking process), this could cause the role to be flagged as unavailable.
You might try putting your startup process into a background thread that runs independently. This should help you keep the role from being recycled while the process is starting up. Just keep in mind you may need to make some adjustments if you get requests before the role fully starts up. There's also a way (that I can't seem to recall ATM) to flag the role and take it out of the load balancer temporarially while your process completes.

Resources