CoreOS, Fleet and Etcd2 fault tolerance - amazon-ec2

I have a 23 node cluster running CoreOS Stable 681.2.0 on AWS across 4 availability zones. All nodes are running etcd2 and flannel. Of the 23 nodes, 8 are dedicated etcd2 nodes, the rest are specifically designated as etcd2 proxies.
Scheduled to the cluster are 3 nginx plus containers, a private Docker registry, SkyDNS, and 4 of our application containers. The application containers register themselves with with etcd2 and the nginx containers pick up any changes, render the necessary files, and finally reload.
This all works perfectly, until a singe etcd2 node is unavailable for any reason.
If the cluster of voting etcd2 members loses connectivity to a even a single other voting etcd2 member, all of the services scheduled to the fleet become unstable. Scheduled services begin stopping and starting without my intervention.
As a test, I began stopping the EC2 instances which host voting etcd2 nodes until quorum was lost. After the first etcd2 node was stopped, the above symptoms began. After a second node, services became unstable, with no observable change. Then, after the third was stopped quorum was lost and all units were unscheduled. I then started all three etcd2 nodes again and within 60 seconds the cluster had returned to a stable state.
Subsequent tests yield identical results.
Am I hitting a known bug in etcd2, fleet or CoreOS?
Is there a setting I can modify to keep units scheduled onto a node even if etcd is unavailable for any reason?

I've experienced the same thing. In my case, when I ran 1 specific unit it caused everything to blow up. Scheduled and perfectly fine running units were suddenly lost without any notice, even machines dropping out of the cluster.
I'm still not sure what the exact problem was, but I think it might have had something to do with etcd vs etcd2. I had a dependency of etcd.service in the unit file, which (I think, not sure) caused CoreOS to try and start etcd.service, while etcd2.service was already running. This might have caused the conflict in my case, and messed up the etcd registry of units and machines.
Something similar might be happening to you, so I suggest you check each host whether you're running etcd or etcd2 and check your unit files to see which one they depend on.

Related

What to do when ECS-agent is disconnected?

I have an issue that from time to time one of the EC2 instances within my cluster have its ECS-agent disconnected. This silently removes the EC2 instance from the cluster (i.e. not eligible to run any services anymore) and silently drains my cluster from serving servers. I have my cluster backed with an autoscaling group, spawning servers to keep up the healthy amount. But the ECS-agent'disconnected servers are not marked as unhealthy, so the AS-group thinks everything is alright.
I have the feeling there must be something (easy) to mitigate this, or I'm having a big issue with choosing ECS and using it in production.
We had this issue for a long time. With each new AWS ECS-optimized AMI it got better, but as of 3 months ago it still happened from time to time. As mcheshier mentioned make sure to always use the latest AMI or at least the latest aws ecs agent
The only way we were able to resolve it was through:
Timed autoscale rotations
We would try to prevent it by scaling up and down at random times
Good cloudwatch alerts
We happened to have our application set up as a bunch of microservices that were all queue (SQS) based. We could scale up and down based on queues. We had decent monitoring set up that let us approximate rates of queues across number of ECS containers. When we detected that the rate was off we would rotate that whole ECS instance. Ie. Say our cluster deployed 4 running containers of worker-1. We approximate that each worker does 1000 messages per 5 minutes. If our queue rate was 3000 per 5 minutes and we had 4 workers, then 1 was not working as expected. We had some scripts set up in lambda to find the faulty one and terminate the entire instance that ran that container.
I hope this helps, I realize it's specific to our in-house application, but the advice I can give you and anyone else is to take the initiative and put as many metrics out there as you can. This will let you do some neat analytics and look for kinks in the system, this being one of them.

What keeps the cluster resource manager running?

I would like to use Apache Marathon to manage resources in a clustered product. Mesos and Marathon solves some of the "cluster resource manager" problems for additional components that need to be kept running with HA, failover, etc.
However, there are a number of services that need to be kept running to keep mesos and marathon running (like zookeeper, mesos itself, etc). What can we use to keep those services running with HA, failover, etc?
It seems like solving this across a cluster (managing how many instances of zookeeper, etc, and where they run and how they fail over) is exactly the problem that mesos/marathon are trying to solve.
As the Mesos HA doc explains, you can start multiple Mesos masters and let ZK elect the leader. Then if your leading master fails, you still have at least 2 left to handle things. It is common to use something like systemd to automatically restart the mesos-master on the same host if it's still healthy, or something like Amazon AutoScalingGroups to ensure you always have 3 master machines even if a host dies.
The same can be done for Marathon in its HA mode (on by default if you start multiple instances pointing to the same znode). Many users start these on the same 3 nodes as their Mesos masters, using systemd to restart failed Marathon services, and the same ASG to ensure there are 3 Mesos/Marathon master nodes.
These same 3 nodes are often configured to be the ZK quorum as well, so there are only 3 nodes you have to manage for all these services running outside of Mesos.
Conceivably, you could bootstrap both Mesos-master and Marathon into the cluster as Marathon/Mesos tasks. Spin up a single Mesos+Marathon master to get the cluster started, then create a Mesos-master app in Marathon to launch 2-3 masters as Mesos tasks, and a Marathon-master app in Marathon to launch a couple of HA Marathon instances (as Mesos tasks). Once those are healthy, you can kill the original standalone Mesos/Marathon master and the cluster would failover to the self-hosted Mesos and Marathon masters, which would be automatically restarted elsewhere on the cluster if they failed. Maybe this would work with ZK too. You'd probably need something like Mesos-DNS and/or ELB to let other services find Mesos/Marathon. I doubt anybody's running Mesos this way, but it's crazy enough it just might work!
In order to understand this, I suggest you spend a few minutes reading up on the architecture and the HA part in the official Mesos doc. There, it is clearly explained how HA/failover in Mesos core is handled (which is, BTW, nothing magic—many systems I know of use pretty much exactly this model, incl. HBase, Storm, Kafka, etc.).
Also, note that—naturally—the challenge keeping a handful of the Mesos masters/Zk alive is not directly comparable with keeping potentially 10000s of processes across a cluster alive, evict them or fail them over (in terms of fan out, memory footprint, throughput, etc.).

How to set up Percona Xtradb cluster with Amazon AutoScaling?

I want to make a cluster of 3 Percona Xtradb+application servers in ec2 using AutoScaling groups, so that if some server fails for some reason, it can be shut down and ASG would then restart the server getting all the current data from the other 2 working servers.
So to implement this I've made 3 instances (A, B, and C) and on initial startup instance A tests port 4567 of instances B and C and if on any of them this port is open, Xtradb is started with proper wsrep_cluster settings, SST is fetched from the running instance.
If that port is closed on both instances, A starts with wsrep_cluster=gcomm:// so it becomes the "origin" of the cluster, thinking instances B and C were simply never started yet, waiting for them to connect later.
The problem is, if instances B and C are running, but A can't connect to them on launch, "split brain" is going to occur. How do I avoid this situation?
If A cannot talk to B and C when A starts up, then A will bootstrap. You won't really have split brain. You would have two separate clusters. You would have existing data on B/C and no data on A.
You probably need service discovery, something like Consul or etcd, to function as 'source of truth' for the status of your cluster in an automated fashion, like you are trying to achieve. On startup for each node, contact Consul and look for a key-pair representing any nodes. If none, bootstrap and then register with discovery service. Each node, once online, should have a regular update to srv disc saying "I'm still here".
The real problem occurs when all nodes go down and ASG has to rebuild all of them. Where does the data come from in this case? There would not be any. This is one of the biggest downsides to automated configurations like this. It would be better for you just to have proper monitoring for when a node goes offline so you can take smarter actions.

Node thinks that it is online when it's network cable is unplugged. Pacemaker/Corosync

I am trying to cluster 2 computers together with Pacemaker/Corosync. The only resource that they share is an ocf:heartbeat:IPaddr this is the main problem:
Since there are only two nodes failover will only occur if the no-quorum-policy=ignore.
When the network cable is pulled from node A, corosync on node A binds to 127.0.0.1 and pacemaker believes that node A is still online and the node B is the one offline.
Pacemaker attempts to start the IPaddr on Node A but it fails to start because there is no network connection. Node B on the other hand recognizes that node B is offline and if the IPaddr service was started on node A it will start it on itself (node B) successfully.
However, since the service failed to start on node A it enters a fatal state and has to be rebooted to rejoin the cluster. (you could restart some of the needed services instead.)
1 workaround is the set start-failure-is-fatal="false" which makes node A continue to try to start the IPaddr service until it is successful. the problem with this is that once it is successful you have a ip conflict between the two nodes until they re cluster and one of the gives up the resource.
I am playing around with the idea of having a node attribute that mirrors cat /sys/class/net/eth0/carrier which is 1 when the cable is connected and zero when it is disconnected and then having a location rule that says if "connected" == zero don't start service kind of thing, but we'll see.
Any thoughts or ideas would be greatly appreciated.
After speaking with Andrew Beekhof (Author of Pacemaker) and Digimer on the freenote.net/#linux-cluster irc network, I have learned that the actual cause behind this issue is do to the cluster being improperly fenced.
Fencing or having stonith enabled is absolutely essential to having a successful High Availability Cluster. The following page is a must read on the subject:
Cluster Tutorial: Concept - Fencing
Many thanks to Digimer for providing this invaluable resource. The section on clustering answers this question, however the entire article is beneficial.
Basically fencing and S.T.O.N.I.T.H. (Shoot the other node in the head) are mechanisms that a cluster uses to make sure that a down node is actually dead. It needs to do this to avoid shared memory corruption, split brain status (multiple nodes taking over shared resources), and most make sure that your cluster does not get stuck in recovery or crash.
If you don't have stonith/fencing configured and enabled in your cluster environment you really need it.
Other issues to look out for are Stonith Deathmatch, and Fencing Loops.
In short the issue of loss of network connectivity causing split brain was solved by creating our own Stonith Device and writing a stonith agent following the /usr/share/doc/cluster-glue/stonith/README.external tutorial, and then writing a startup script that checks to see if the node is able to support joining the cluster and then starts corosync or waits 5 minutes and checks again.
According your configuration, the heartbeat between two nodes will use "127.0.0.1" , i think it's totally wrong.
Usually the corosync need to bind to private IPs, and the resource IPaddr service should use different ip which named traffic IP.
For example:
Node A: 192.168.1.00 (For heartbeat); 10.0.0.1(traffic ip)
Node B: 192.168.1.101 (For heartbeat) ; 10.0.0.2(traffic ip)
If my understanding is correct ,ipaddr service will startup an virtual ip base on traffic ip, we assume it's 10.0.0.3.

AppFabric Redundancy

We just tested an AppFabric cluster of 2 servers where we removed the "lead" server. The second server timeouts on any request to it with the error:
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:
There is a temporary failure. Please retry later.
(One or more specified Cache servers are unavailable, which could be caused by busy network or servers. Ensure that security permission has been granted for this client account on the cluster and that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Retry later.)
In practive this means that if one server in the cluster goes down then they all go down. (Note we are not using Windows cluster, only linking multiple AppFabric cache servers to each other.)
I need the cluster to continue operating even if a single server goes down. How do I do this?
(I realize this question is borderlining Serverfault, but imho developers should know this.)
You'll have to install the AppFabric cache on at least three lead servers for the cache to survive a single server crash. The docs state that the cluster will only go down if the "majority" of the lead servers go down, but in the fine print, they explain that 1 out of 2 constitutes a majority. I've verified that removing a server from a three lead-node cluster works as advertised.
Typical distributed systems concept. For a write or read quorum to occur in an ensemble you need to have 2f + 1 servers up where f is number of servers failing. I think appfabric or any CP (as in CAP theorem) consensus based systems need this to happen for working of the cluster.
--Sai
Thats actually a problem with the Appfabric architecture and it is rather confusing in terms of the "lead-host" concept. The idea is that the majority of lead hosts should be running so that the cluster remains up and running. So if you had three servers you'd have to have at least two lead hosts constantly communicating with each other and eating up server resources and if both go down then the whole cluster fails. The idea is to have a peer-to-peer architecture where all servers act as peers meaning that even if two servers go down the cluster remains functioning with no application downtimes. Try NCache:
http://www.alachisoft.com/ncache/

Resources