Mesosphere marathon restart task on all instances - performance

I am running a mesos cluster on 3 instances each running both mesos-master and mesos-slave. I believe the cluster to be configured correctly and able to run web app via docker and marathon on all three instances.
I set up a Jenkins to perform a deployment to the cluster and as the last step post to marathon REST API to restart the job, however it fails silently (simply stuck at deployment stage) . However, if the app is running on 2 instances, restart goes smoothly. Does marathon require one instance to be unoccupied to perform app restart?
Am I missing something here?

Are there enough free resources in your cluster?
IIRC the default restart behavior will first start the new version and then scale down the old version (hence you need 2* app resources).
See Marathon Deployments for details and the upgrade strategy section here.
Here the relevant excerpt from the upgrade strategy:
upgradeStrategy
During an upgrade all instances of an application get replaced by a new version. The upgradeStrategy controls how Marathon stops old versions and launches new versions. It consists of two values:
minimumHealthCapacity (Optional. Default: 1.0) - a number between 0and 1 that is multiplied with the instance count. This is the minimum number of healthy nodes that do not sacrifice overall application purpose. Marathon will make sure, during the upgrade process, that at any point of time this number of healthy instances are up.
maximumOverCapacity (Optional. Default: 1.0) - a number between 0 and 1 which is multiplied with the instance count. This is the maximum number of additional instances launched at any point of time during the upgrade process.
The default minimumHealthCapacity is 1, which means no old instance can be stopped before another healthy new version is deployed. A value of 0.5 means that during an upgrade half of the old version instances are stopped first to make space for the new version. A value of 0 means take all instances down immediately and replace with the new application.
The default maximumOverCapacity is 1, which means that all old and new instances can co-exist during the upgrade process. A value of 0.1 means that during the upgrade process 10% more capacity than usual may be used for old and new instances. A value of 0.0 means that even during the upgrade process no more capacity may be used for the new instances than usual. Only when an old version is stopped, a new instance can be deployed.

Related

Hortonworks Data Platform: High load causes node restart

I have setup a Hadoop Cluster with Hortonworks Data Platform 2.5. I'm using 1 master and 5 slave (worker) nodes.
Every few days one (or more) of my worker nodes gets a high load and seem to restart the whole CentOS operating system automatically. After the restart the Hadoop components don't run anymore and have to be restarted manually via the Amabri management UI.
Here a screenshot of the "crashed" node (reboot after the high load value ~4 hours ago):
Here a screenshot of one of other "healthy" worker node (all other workers have similar values):
The node crashes alternate between the 5 worker nodes, the master node seems to run without problems.
What could cause this problem? Where are these high load values coming from?
This seems to be a Kernel problem, as the log file (e.g. /var/spool/abrt/vmcore-127.0.0.1-2017-06-26-12:27:34/backtrace) says something like
Version: 3.10.0-327.el7.x86_64
BUG: unable to handle kernel NULL pointer dereference at 00000000000001a0
After running a sudo yum update I had the kernel version
[root#myhost ~]# uname -r
3.10.0-514.26.2.el7.x86_64
Since the operating system updates the problem didn't occur anymore. I will observe the issue and give feedback if neccessary.

How to replace ECS cluster instances without downtime or reduced redundancy?

I currently have a try-out environment with ~16 services divided over 4 micro-instances. Instances are managed by an autoscaling group (ASG). When I need to update the AMI of my cluster instances, currently I do:
Create new launch config, edit ASG with new launch config.
Detach all instances with replacement option from the ASG and wait until the new ones are listed in the cluster instance list.
MANUALLY find and deregister the old instances from the ECS cluster (very tricky)
Now the services are killed by ECS due to deregistering the instances :(
Wait 3 minutes until the services are restarted on the new instances
MANUALLY find the EC2 instances in the EC2 instance list and terminate them (be very very careful not to terminate the new ones).
With this approach I have about 3 minutes of downtime and I shiver from the idea to do this in production envs.. Is there a way to do this without downtime but keeping the overall amount of instance the same (so without 200% scaling settings etc.).
You can update the Launch Configuration with the new AMI and then assign it to the ASG. Make sure to include the following in the user-data section:
echo ECS_CLUSTER=your_cluster_name >> /etc/ecs/ecs.config
Then terminate one instance at a time, and wait until the new one is up and automatically registered before terminating the next.
This could be scriptable to be automated too.

Elasticsearch cluster data migration to new cluster

We have a Elasticsearch cluster which is running on elasticsearch 1.4 and logstash 1.4 with 1 master and 4 data node, now I want to upgrade the version of elasticsearch to 1.7 and logstash to 1.5 without losing any data. So my plan is to create a new cluster with new nodes and restore the snapshot of the current cluster on that. Now my question is this the best way or upgrade the versions on the current cluster. I am bit of nervous because it a production logging stack working smoothly.I don't want to mess around with production cluster with testing
First of all, read documentation. As you said, you'd like to upgrade from 1.4 to 1.7, which means there's no significant version jump.
Documentation states that upgrading from 1.x version to another 1.x version you have to do a rolling upgrade. What's that? Quoting documentation:
A rolling upgrade allows the ES cluster to be upgraded one node at a
time, with no observable downtime for end users.
Which means you can shut node down one by one, upgrade its binaries and turn it back on. One node by one!.
Of course, always do a backup in case **** happens.

Jenkins not starting any more EC2 slaves when using labels

I am using Jenkins v1.564 with the Amazon EC2 Plugin and set-up 2x AMIs. The first AMI has the label small and the second AMI has the label large. Both AMIs have the Usage setting set to Utilize this node as much as possible.
Now, I have created 2x jobs. The first job has Restrict where this project can be run set to small. The second job, similarly, set to large.
So then I trigger a build of the first job. No slaves were previously running, so the plugin fires up a small slave. I then trigger a build of the second job, and it waits endlessly for the slave with the message All nodes of label `large' are offline.
I would have expected the plugin to fire up a large node since no nodes of that label are running. Clearly I'm misunderstanding something. I have gone over the plugin documentation but clearly I'm not getting it.
Any feedback or pointers to documentation that explains this would be much appreciated.
Are the two machine configurations using the same image? If so, you're probably running into this: https://issues.jenkins-ci.org/browse/JENKINS-19845
The EC2 plugin counts the number of instances based on
Found there's an Instance Cap setting in Manage Jenkins -> Configure System under Advanced for the EC2 module, which limits how many instances can be launched by the plug-in at any one time. It was set to 2. Still odd as I only had one instance running and it wasn't starting another one (so maybe the limit is "less than" ). Anyway, increasing the cap to a higher number made the instance fire up.

Master node needs to execute mdpboot everytime after fresh restart of system

I have to explicitly boot all the processor nodes every time when my master system restarts
i.e
mdpboot -n 3
P.S
I am implementing beowulf cluster
Try updating to a new version of MPICH. MPD is quite old and hasn't been supported for some time. The newer versions use Hydra which doesn't require booting.

Resources